BOCOHEHT BESQBT 



' ED> 191 893 IH BOO 517 



TITLE Interpreting Variance _ Cbmponents as EviSeaee for 

_ ■ Eeliafcility and Validity* 

POB DATE • Apr-BO ^ : _ 

NOTE *5;7p.: Paper presented at the Annual Meeting of the 

Acierican Educational Research Association (6Uth, • 
: - . Boston, WA, April 7-1 1 , 1980) • 

EDBS PEICE ' KFdi/PCO'3 Plus Postage. 

.DESCBIPTOSS Attribution Theory Behavioral. Sciences; *Errbr of 

Measurement:- Hathematical Hbdels; *a€asureEttent: 
Observation: Physical Sciences :_*Sampiing; *Iest 
Beliability: *fest Theory; *Test^ Validity 

IDENTIFIEBS ; *i3enerali2ability Theory; '-Invariance Principle 



'absteact 

__ _ -The reliability and validity_of measurement is • 

analyzed by a sampling model based on generalizability- .theory A 
model for the relationship between_a mea:Surement procedure and an 
attribute is developed., from an analysis of hbW seasurements are used 
and interpreted *in science^ The model provides a basis for : anaiyzihg 
the concept of ah error of measurement, the distinction between;^ . 
random errors and systematic error^ of meaisureient 

procedures, the distinction between reliability and validity, and an 
analysis cf convergent and discriminant validity^ Within the sampling 
ffodel, the concepts of reliability, , validity, and import arise 
naturally as necessary requirements fcr the re&ults of measurement to 
be meanincful* This model also provides a basis ifbr examining the 
relationship between measurement and theory.^, and reviews some general 
teehniqnes^cr developing theory and cchtf oiling errbr^ of 
meas.urementi ?Author/eP) ^ , " 




* Beproductiphs supplied by EDBS are the best that can be made : * 
* ^ from the origiS'al document. 5 * 

a|c * * a|e alTale a|e a(c 3|c i|c i|c * * 



CO 
« — I 

r-H 

CD 




-4^ 



INTERPRETiNS VARIANGE ^MP0NE^^^%V!^El^eE FO 
RELIABILITY AND m^Tj^*^ ^^^-^ 




U S-a£P-A.RTM.ENTOF HEALTH. 
EPUCATJ0N4WELFAR& 
NATiONAL iNSiiTUTE OF ^ 
EOOCATION 

•dqced exactly as RE-cEiveaJ^^ROM 

THE PERSON ORORGANJZATl.ON. 0RJ(^'N: 
I, Pn^ i t :?0.. NT S .Of: V . _E W OR O P . N . O NS 
STATED DO NOT nECESSaR IL^ -REPRE' 
SENT0FFIC1A.L NAT4bNAt INSTtTUTE OP 
EDUCATION POSITION OR POLICY 



^■i^-JWa'tidnaV Leigue for riursing 



'♦PEPMlSSlb N TQj 
MATERIAL 




, HAS BEEN GRANTED BY 

N ^ 1 — ^ 

, — — 

INFORMATION CENTER (ERK^ ^ . 



Paper presgnted at the annual meeting of the American Educational 
Research Association 

Boston, April, 1980 



ERIC 



.1 iNTRdDueridN ' . 

• , - * . - 

the_techhical quality of behavldral measurOTehts_iS;^ evalUate'd in 

terms (Df two properties, reliability and validity^ Reliability is. associated 
with the precision oY measureneriti arid reflects the degree of eiDrisisteney mohg 
independent observat-ions; Validity i^ concerned^with the_"m^ ' j 

measurement - that is/with the- interpretation to be given to the observations; 

The aim of tMs paper i*s to provide an analysis of^ the measurement of" . 
dispositional attributes.' A sampling model for the relationship between'a.^ 
measurement procedure and a cfispositi-onal attribute is developed from an . / [ 
analysis of how dispositional terms are used and interpreted in science. 'The 
model provides a basis for analyzing the concept df an error of fneasurenerit, 
the distinction between r:andom errors arid systematic errors, staridardizatidri 
of measurement procedures^ the distinction between r^3i ability 

and convergent and discriminant validity^ Within tf^s: sampling modjel^ the . 

concepts of reliability and validity arise natural ly^ki^.aecessary requirements 
for-the results of measurement to be meaningful. j -^'^H;,. ' 

.... .... . . ... . . i " . . 

All reliability indices describe the agreertlent andng rejDeated measurements 
dn_the Same individuals. The separate measurOTeritS: jfor each individual must 
differ in some of their cdoditidns of observation, and the different _\ 
reliability coefficients allow different conditions 'to vary from one set of 
observations to andtfier; Althqugh^they differ in tfieir definitions of error, 
the different reliability indices all assume that there is a single i 
undifferentiated source of errors; 

Generalizability theory (Cronbach^ Gli5ser^^ Nandaj^ and RajaratnSfn^ 1972) 
provides a multifaceted analysis df the consistency ^'d^ by relating 

the variance _ihdbserved scores td the sampl ing_df different kinds df 
conditions of "dbservatidri, and estimates the.relative impact of the varidus 
sources df incdnsi'Stency. Therefdre, general izability thedry prdvides a 
general framework in which to examine the^dependabllity of measurements, and ? 
it is this framework which is used throughout this paper. 

-• . ^ ; : ■ ; 

Such a general framework does not exi st for the val idi ty of measurement . 

Criterion validity examines the agreement between observed scores and some - 
external criterion, and typically uses correlation coefficients to yield a 
single numerical estimate of validity. Content validity examines how well the 
dperatidns empldyed in a measuretient prdcedure match the characteristic being 
measured, and the results df a study of cdnterit validity are usually stated as 
qualitative judganents, rather than as a numerical coefficient. 

Construct validity is more general th^ content validity or 

criterion validity^ \ It emphasizes the legitimacy with^which various inferences 
can be drawn on the ba^i s of observed scores^ and al lows for .a wi de range af " 
techniques, correspon^ding to the fange of inferences to be drawn. -Construct 
validity may employ the^triethdds df criterion validity or of content validity^ 
but may also use a variety of other techniques. 

Ualike.rSliability, which is defined in terms ^agreement amdhg observed 
scores, validityinvdlves the interpretation of an/oBserved score as _ ' 
representative of some quantity which is not "directly observable. .Validity- 
requires the assignment df meaning to observed scores. ^ . 

_ \ ___ . _ _ _ _ ^ ' ' 

In ihtrdductory textbooks, \/alidity is^often equated with the extent to 
which an- observed score measures^ "what it is intended to measure." Although 
this statement is tdd vague to provide an adequate definitidn, it does 
emphasize two impdrtarit points about validity. Firsts the looseness of the 



statement allows for a very wide range of . procedures for evaluating the 
validity of . a:fTieasjjretnent pracedure,. and this isVxdhsistent with, practice.. 
Second^ . it. suggests ^the existence^pf a. -real" value for, an attribute, withput 
specifying what thjs "real" value represeritSi It is often iitiplieitly assumed 
that. the "real" value of the attribute exig%s sdmeWhere, arid that validatidri 
requires a edmparisdri; direct or indirect, >betweeri.th^ the ; 

observed score; Criterion validity tends to encourage this^process of 
reifiqation by introducing the notion of the crtterion, which is easily 
confused with. the "real" value of the attribute, "\ 



Reliabil ity involves compari sons-, among observed scores and is Intended to 
indicate the consistehcy of .measuranehts. Validity seeks to establish _ah 
appropriate interpretation for observed scores, Sincea high degree of _ 
consistency in measuririg the wrong attribute is gerierally speri as (being less 
useful than a lower degree of consistency in measuring the intended attribute, 
validity is gerierally considered lo be more important than reliability. 

Given the great importance assigned to validity, it is surprising^that. the 
evidence for the validity of most: behavioral measurements is less adequate 
than tfie evidence for their reliability. In many cases, evidence for validity 
is practically nonexistent. Ebel{1961) has aptly described this. dilemma: 

"Validity has long beeri one of the major deities in the . 
pantheon of^the psychometriciari; It'is universally praised, ~^ 
' but the good works done in'-its name are remarkably few. 
Test yal idation^ in fact^ _is_wide]yiregarded as the least 
satisfactory aspect of test development." 

. _ <> _ . _ . . . . . 

This situation has not improved markedly since 1961. 

Ebel (1961) also points but that physics, which Campbel 1 (1957.) has called, 
"the science of measurement",. does not seem to ericdunter :;prdblems_of _ 
validation. 0ne reason for^this difference between the behavioral sciences 
and the physicalsciences is that m gives more . 

attention to^ statistical procedures and assumptions than it giyes-_to the . 
analysis of how measurements are used (Construct validity, as developed by 
Cronbach and Meehlj^ 1955^ and Cronbach, 1971, is a clear exception to this 
generalization). In the physical sciences^ this situavtion is reversed; there^ 
the statistical methods used to evaluate measurement procedures are relatively 
simple, but these methods are: closely related to the practice of measurement 
arid its interpretatiori. 

Tiie next two sections are devoted to a discussion of how attributes and . 
measurements of attributes are interpreted. Several simple -examples of - 
physical measurement will be intrbduced^in this discussion.^ These examples ■ 
are used because the connection between 'the^ interpretation of^iDeasurem^ and 
the_indices used to evaluate the accuracy of these measurements is : . 
particu3arly. clear in physics. . - : . • 



GeneralizabiTity theory (Cronbach et al, 1972) provides the framework and. 
methodology for this paper, arid most of the results derived will be stated'^ in 
terms of variarice components. However, the emphasis throughout the paper is 
on the issues that can be addressed in generalizability theoi^^; rather than on 
the statistical models; Except 'for an occassional remark about the . severity 
of some estimation problems, there is no discussion of the complex estimation 
issues associated with generalizability theory. 

OVERVIEW _ _ ; _ _ 

_ The analysis_preserited iri this paper, is "quite Idrig, .arid^ in some ways 

relatively eorivplutedi Ah overview of t?ie mairi pdirits iri t^he development may - 

therefore provide a useful rbadmap; 



Section il exarnines this; interpretation given tO:^osrppsitTona1 'a|trfiaite%,5^' . 
bispositions can be operationany defined in terms_ of^cl asses, or uftiyer 
of possible observations. . The numerical value to be assigned 'te* an^'attributfe ' 
is defined as the expected_va.lUe over this universe^, and measurerni^h^f ..a^ _}'f' 
interpreted as estimates of this -^expected value. ^ t*.^/ • ?* < 

Seetidri III analyzes_the' process of measurement ahd^'del^y^^^ 
assumptions that are implicit in -the. ordinary interpret at idnf^'d^KSbse^^^ , 
scores. The estimates generated by; a measurement procedure are basefe'^**'.: ' - . r 
samples from the universe; -conpquently, the mo,del f5r: m^ - 
sampl ing^ model . Estimates of the expected value over a universe, ^-b as e,d'^§pi^^ 
different samples^ will not generally be equal, and, in ordef to rreji^^fn^^ • 
consistency in the interpretation of measurements, an expTicit/theory. 6f^^ 
errors must be intrdduced. Therefore, . the definition of errof " ^^-base^'^ii.^he?^ • 
iriterpretatidn given to the measurements^ and errgrs of measurement ';arfe,;^jd^ 
'by substantive c&nsideratidris rather than statistical assumptidns. ' 

Section IV outlines the terminology and notation of general izability^thedry 
and introduces a sampl ing model for val idity. The validity of measurements >^of 
a dispositional attribute_1:S ^fined in terms of Jthe^ accuracy with wh^ the 
observed scores estimate the expected value for the appropriate universe. 
Where the dblerved. scores are obtained by drawing simple random samplel from 
the appropriate universe^ ah ihdexidfrval idity can be dbtained directly by. 
estimating a geheralizability cdeff icieht. ^ Iri.praCf ice, the sampling model 
for validity becomes quite complicated wh.eh it is mddified td take accduht df 
the sampling designs actually used fn measurement procedures. 

Section V examines the effects of \he stahda^^^^ 
procedures. Standardized measurements Jinvolve two kinds of errors: randorn 
errors^ which vary from one observatiori| to another, and systematic errors, 
which are constant fdr a series df measurements. Ran^dom errors are related to 
reliability, and systematic errprs are rreUted td val idity; this analysis 
makes it pdssible to draw a clear distinctidn between' reliability and validity. 



Section VI discusses the relationshi 



3 between the deve^pment of thedry and 



the definition of dispositional attributes^^ A third property of measuremerits 



is introduced by defining :the concept of 



import in terms of all of -the 



inferences that can be drawn from an pbseryed score. This sgct^ 
potentially powerful techniques for developing theory and controlling errors of 
measurement. " ^ .J 



of the .results.derived in this' paper are based on rather strong, 
sampling assumptions. In particular,' the unbieised estimatidh df variance _ 
components, which are used extenisively in.this paper, requires rahddm' sampling, 
assumptions. In most cases of practical, interest, these assumptidns are.likely^ 
to be^violated. In^ section yil^[thes§ assumptions, and the robustness df the' 
results derived from these^as^ are examined. Section VII al-so 

presents some concluding coninents. , ' > ^ 



li The interfSretatibh of Measdr ■ • /. * 

Lord and.Novick(1968, p. 17) define measurOTent as "a^ :. 
assigrimeht of numbers to specified properties of experimental^units In . • / . .-^ 
such a way as to charaGterize an'd pr^eserve, specif ied rel^ in -the ' 

behavioral :domain". In diseUssihg the •irlethddd logy of physics^ CaiTipbell (1957* - 
p;267j defines measurement as "the process of assigning numbers to^represent 
. qua! i ties". ' ] . ' 

According to Nunal ly(l967, p;2^ "Measurement consists of .rules for. 
assigning -numbers to objects to represent quantities of attributes." _If the 
word "objects" is interpreted broadly to include persons* and groups of persdqs- : . 
as well as physical objects and systems ^ Nunally's definition applies to 
measurement in both the physical, and the behavioral sciences. I . . '\ - 

Reasurarieht consists of the mapping cif objects into real, numbers, and • 
establishes a functional relationship between real numbers and-^ the members of 
soSe class- of objects; Depending on the' attribute being cdnsbidered', the. , 
object of measurement may takei a' variety of forms, including -physical objects, ^ 
persons, pairs of objects or persons, groups, and various complex systems, f ^a^^. 
The rules used to assign the numbers may also vary considerably. .However, t^e.^ ^ 
prdcess of measurement always involves a mapping of the form:.. ' . , 

Up = A{o) ^ ' ^ - C2j) , . : 

where _o is an object, A represents the rules used to assi-gri numbers' for the 
attribute, and Uq is the real number assigned td jo for the^tttribute', A. . ^ 

Note that Eq(2.1) makes a fundamental theoretical corTSitment,V.j'n\ttiat it*, 
implies that the attribute depends only on the object of measurement and does 
not depend on any of the conditions that may prevail when the observations _ar?- 
made. For example, the statement that the length of a particular rod is-10 :} 
inches can be represented as: " ■ * . . ^ 'v 

where L represents the procedures used to measure .length in inches and, r : 
represents the rod. This formulation implies that the length df the'rdd cjo^s 
4iot depend, for "example, on the location!, orientation, .or temperature of t'he" 
^dd. The Jength is also assumed to be independent of the-.person who^carries'. 
out the opei^afions represented by L.. . ^ . . ' 

^ Eci(2.i) prdvides a very general symbol ic representation of the process of . \ 
measurOTent. However, this definition is not very informative or very "useful 
unless the nature of the^objects o and the fuhctiOhs, A, that are involved are 
well.understood. In practice, both the function^ A, and the object, 0, may be 
'quite complex. * ' • - ' ' 

ATTRIBUTES _ 

A measurable attribute can be viewed^as a. disposition, or 'a; tendency to , 
react in a certain way to some kind of conditions. Dispositions may be j 
qualitative Or quantitative. For a qualit-ative.disposition^/the object 'is 
said to have the attribute if a; specific reaction occurs, and is said not to^-; 
have the_ attribute if the. specif ic reaction does not occur. A classic example - 
of a qualitative disposition ,is the property of being a magnet. The typical / 
test condition would consist Of placing a small piece of iron near the object 
being tested.- If the iron tends to move toward the Object, tfie Object is said. 
.to:be_a magnet,: anti if the iron shows ho tendency td move toward the dbject^ 
the object is said to be ndhmaghetiCi 



FoF^a ^dahtitative dispds.itibriV^a hmber ^ns assigned. to th^_db3\ect on the 
basis of t-hejstrehgth of the reaeticjn to the; iest._£Qhditi oris, . The" magnitude 
of the attribute of being maghetic;>Dr Ihe sirengtTi of a magnet, ^cduld be 
defined by how far it moves a piece of^roni" . . - , 



There are basically two ways In whicFi measurable attHbutes are 1^^ 
into science. In the *early: stage's of ",.any science,: attributes are developed by 
quantifying ordinal relatid/lshipi; Subsequently-^ other "attributes can be ■ 
derived from empirical 'laws< ^ ■ % ' " . • - * • . 

- , . ^ . I. ■ • _ *_ _ ; . . - 

The process of quantifying oPd'^nal properties is one of explication^ the 
transfbrmatiorn of subjective observations into a relatively well defined _ . 
mea'sUreable .attribute. Attributes that are defined in this way will be called 
basic attributes . - It is nS'ticed, for exampje, that some objects are /easier to 
nu)ve tTiali others. It is also n'ot-iced that tbts. ordering of objects remains 
the same regardless iDf where the objects are located, who attempts to move 
them- or when' they are move d^ It is convenient^ therefore, to think\of 
"resistance to' movement" as a prbperty^-or attribute^ of -^the -objects, and a 
large class of sblid.bbjects can be rank-ordered in terms of this -prpperty. 
Wfifere such an.qrdinal property exists f-or a class of objects, numbers can be 
assigned to al 1 . objects^ in tl:)€ class, such that the ordering of the numbers 
^corresponds to the ordering of 'tf^e 'objects. This assignmerit of numbers-^ 
'defines an ordinal scale for the attrfbute. . ♦ 

After .some basic attributes are developed, OTpirical^ 
relationships among those attributes can be developed, an'd;these laws often 
involve constants that can also be treated as' measurable attributes. For 
example, the measured. length, V, of a metal rod is found tb vary 
systematical ly./with the .temper atur.e, t^, of the rod. If dt^ is a change - 
in the temperature of a rod arid dl^ is the correspdnding change in lengthy 

dl^ ^ dt^ ]- r (2.2)' 

where k^ is a constant; called the* coefficient of thermal exparisidri df the 
rod. Estimates" of k^ are obtained by charigirig^the t^ of the rod, - ^ 

and measuring this change in temperature 'and the cbrrespond-ing change in 
length. An estimate of kp is then given by the ratio of tfre change in length 
to the change in temperature. The operational definition of k^ depends on 
the (Jefiriitibn^ of length and temperature; two basic attributes, and on the 
empirical law.irr Eq{f .2) which states a relationship between the two basic 
attributes. Therefore" the Interpretation of the coefficient of thermal 
expansion as a measurable, attribute is derived frdm the iriterpretatidti of two 
basic attributes, and the law relating these two attributes. _ 

. _ J** _ _ . * . ~.. 

The value of k^ varies from .one* rod to another but remains relatively 
constant from -one dbservation. to another oh a given rod^ H is convenient^ 
therefore^ to interpret k^ as a property of the rod by arssuming that i^^ 
depends dn the rod but not on the conditions prevailing when the rod is 
dbseioved: . ^ ^ " , ^ ^ ' ^ ' • 

kV=K(r} ■ ■ ' ' \, (2.3) ' - . 

, _ ' ■ ^ . ' 

The assumption that kp doesri^^t depend on the conditioris df dbservatiba as a ' 
good approximation over a wide class of observations. which is taken to be the 
universe of generalization. Any observation from this universe of 
generalization could be used to estimate k^^." 

The iriterpretatidri of numbers.,as the 'values ^f an attribute 'depends on . 
empirical laws that state that- different observations on any pair of objects 
generally rank-order the objects iri the" same way. Therefore the results of 



any set_of observations {Jrovide informatign about a rnuch wider_c1as_s of 
observations. that could have been made.'ltis th1s_g6nera1i2atign_frpm 
particular • observations to a universe of observations that t^royides the 
meariirig of an attribute arid that makes measurements of. the attribute useful. 

: 9P ERATI ON Ab BEF I N if I ONS ^ / ' " ■ ' • 

. The rules- that are used to assign. a valtie-.td an attribute are tisually: 
called operational def inTtions (Bridgman, 1927) i The ru^es are operatidrial in 
' the sense that they are stated in terms of the dperatidns performs in" - 
" measuring the attribute. The/rules are said t& be definitions, because, they 
provide most of the meaning of the attribute; that is^ they provide a.^basis 
for.interpreting.the numbers assignecl as values of the -attribute. (Ennis, 
1973, HempeU 1960, and Carriap, 1953, provide arialyses of the Use Of' ' 
operational def i-fri1:ibris .in* scierice) , . 

Bperatioria] definitidri.s gene'ral.ly include two. k;vnds of rules, structural 
rules and selection _^.ules. The structural rules specify the kind of ■ 
observations that' ^aire'to be used, . and the way in which- numbers- are to be 
'derived from these observations. Thus^ psychologists arrange stimulus • 
situations that are likely to elicit the. type of behavior that they wish to 
study.- In -tife\ absence of such standardization of scwrie characteristics of the 
obserVatforis"; it would' be yery.diffieult to provide rules for the assignment 
of nuirrbers. ' Th.e structural rules may be_mGre_dr less' elaborate. A rule for 
measuring the length^pf^a^npd, ;for example, might require that ari observer 
align the zero ma^k of a tape measure with :dne end ,df'the rod, arid record the 
number on the tape measure that coincides wvth the other end of the rod.. More 
^etai^led rules for measuring Jength arejdisc'ussed by Campbell (1957), arid 
others^ but all of these ru1es_ le^ve- some issues openv For example, what kind 
of .tape ^measure is to be useci? Could a slightly bent metaT tape measure be . . 
oised^' and would an observer with astigmatism; be acceptable? 

■ - ■ i ' 

Such questions lead to the developmerit of selection^rules . The selection 
rules specify the.rarige Of conditions that may be tolerated for the various 
characteristics of the observatidris.- Some of these characteristics may be 
fixed; in the example above, one end of the rod must ^be aligriedwith the zero 
; mark of the tape measure; Qther characteristics are specif ied iri terms of 
--i^anges for continuous variab-les; for/example, the temper ature;js to be betweeri 
150c and 25^C, It is assumed that thecharacteristics not mentioned in 
'the structural rules need not be controlled at all. • ^ • 

-Sas tlje example of lerigth iridicates, operatidrial defiriitiori's do not specify 
particulardbservatioris; they specify classes.of Observations. The rules for 
measuring length that were sketched above eduld be made Wore precise arid more 
'complete by specifying a particular tape measure, .a particular temperature, 
etc;, but it would be impossible to specify all of the" characteristics that 
might influence an observation. Furthermore, it would be self-defeating to 
make the rules too specific because this would Jjmit the usefulnessjof_the 
concept of length. Ceteris paribus ^ scientists prefer to use concepts which * 
'are as general as possible. 

Operatidrial defiriiti^oris are desigried to achieve this gerierality of 
application, while providing a clear specification df the class df_ 
observations^allowed. The fact that operational defiriitions specify.classes 
of observations rather than specific observations idoes riot riecessarily imply 
any lack of precision in the definition since classes can be defined 
precisely. ~In practice,, however,^these classes are always def ine'd somewhat:, 
ambiguously. ^, If the lighting in the room_and the vision^f the observer are 
discussed at all for measurements of lengthy the requirement is likely to be^. 
that they be "ridrmaV or "withiri .riormal.^limits." Most of the characteristics 
of , ari dbservatiori are ignored unless there j's some reason to believe that a . 



'particular characteristic is extreme enough to have a_s'erious effect on the 
(^servation. This ambiguitj' is recognized and tolerated because it makes 
■general laws possible (Toulmin,1953} . 



• THE OBJECT jF MEASUREMENT . . . _ . _ _ . / _ 

. The object, or unit, to which a number is ^assigned by me*asuremeht is called 

. the object of measurement . The number representing the, attribute is riot 
assigned to"^n observation.. The operational definition of an attribute 
specifies a class of observations arid no orie of these observations has special 
-Significance. ; * ^ - 

■ A* measur^gfent assigns a number to some object of measurement which is 
/involvedMn^ ' and par-tially defines a number of observations. The purpose of 
measurement is to map objects of 'f?ieasuremerit iato real numbers. The number 
assigned tp sach'_dbject is iritended to represent the magnitude of the attribute 
for .the object i)f measSremerit. A particular observation cari provide 
iriformation about different kinds- of objects of measurement, and if the 
.measurement is to be interpreted, uriambigudusly, it is necessary to clearly 
identify the object of measurement. Cardinet, Tourneu, and Allal(1976) have 
discussed the ^kinds of units which may serve as objects of measurement. 

. ^I-n a s'tudy of' anxiety an observation might consist of the'rigsponse of a 
person to some stimu-lus in a particular.cdntext. For ^such dbservatidris, the 
person .1s usually taken as the dbject df measuremerit. However, fdr researchers 
seeking to determirie .how arixiety provokirig stimuli or contexts are, the o^bjects 
of measuremerit would be stimuli or contexts, respectively, instead of the' ^ 
person, . 

More complicated objects of measurement can also be considered. The_ 

differential impact "of stimuli on different persons ciould be investigated by 
taking person-stimulus pairs as the objects df measurement.' A researcher who 
is irivesti^gating* the interactidris between persdris arid cdritexts might take 
person^cdritext pairs a§ the dbjects df measuremerit. ^ 

Although the observation has a sirigle assigried value, this value may be, 
given various interpretations deperidirig on the definition of the object -of 
measurement. The specif ication of the object of "measurement is a conceptual 
issuie, and is^not uniguely determined by. the natur^ of the observations that 
are made; As the examples above illustrate, a single, observation can provide 
information about a variety of objects of measurement. Similarly, mariy 
different dbservations may be used to measure a particular attribute for an 
object of measuremerit. 

The distinetiori that is often drawn iri psychology, betweeri a state-arid a 
trait depends on a distirictiori betweeri differerit kirids of objects of measure- 
ment. If the object of measuremerit .is taken to be a person iri a particular 
context, then the attribute being measured is a state variable, which is \ 
assumed to be a function of both the person, and the context or" time. It is, 
therefore^ expected -that the value associated with a state variable will _ 
change as the context changes dyer time.\ Fdr^ a trait, hdwever, the dbject of 
measurement is the persdri^ arid the value df the trait variable is assumed to 
be irideperiderit of time. It-is recdgriized df^ course that the behaviors : 
associated with the trait variable wrll be exhibited to different de'grees .in 
differerit contexts, but this is^true of al,l dispositional variables. For a- • 
.trai^t variable, changes in -the observed v^artable over timg. are taken as errors 
of measurement ;_f or a state variable such djffar^c^^Ve accounted for by 
differences in the value of the state variable.- - 



. _.A similar distihet^oh is made in physics between mass and weight; Mass is 

defined to be an attribute ef a physical object; whi 1e weight. is defined to 

depend oh the physical object and the location of the object ih ;a . 

gravitational field; . • 

In the physical sciences, the object of measuremieht to^v/hich attributes - 
are assigned are specified explicitly. In their introductory treatment of 
mechanics^ Corben s" Stehl (I960 j state the following assumptions: 

A particle, is described when its position in^ space is. given 
and when the values of certain parameters such as_mass, 
electric charge, and magnetic moment are given. By our 
definition of a particle, these parameters must have constant 
values because they describe the internal constitution of^the 
particle. If these parameters do vary witB time, we are not 
^dealing with a simple particle. The position , of a particle 
may, of course, vary with time,(p.6} 

Therefore the mass, charge, and magnetic moment are to be treated as trait 
variables, with particles as their objects of measurement. Position, however, 
is to be treated -as a state variable with, particle-time combinations as its 
.objects of measurement. 

•It is sometimes claimed that operational ly defined attributes are valid by 
definition. _It is maintaihed that the operations _used to measure the 
attribute define the attribute, and the- results of these bperatiahs are, by 
definition, the values of the attribute. According to this view no 
interpretation is to be given to'thehumbers assigned to objects^beybnd the 
fact that they result'from the particular set of op^ Therefore there 

is ho inference from the results of the operations to a wider class of 
observations and no need to "check on the accuracy of inference. 

However, in practice, the operational definitions of even the most narrowly 
de'fined attributes involve classes of observations rather tharf particular 
observations. No operational definition that is of any significance in science 
specifies a particular observer^ John. Jones):, [5articular equipment (voltmeter 
#6)_and particular time and place. Although restrictions may be placed oh the 
qualifications of observers and dh the type of equipment used, these 
restrictions define classes of observa'^ioris rather than particular 
observations. If the results of a partS^lar observation 'could not be used to • 
draw inferences about similar observat^ns, these results would be of little 
interest. ^ 

/ Unlike observations, attributes are not_ unique to a particular cortibination * 
of time, place, observer, etc. Attributes have a generality that is not 
associated with dbservatidns, and_to assign a valueto ah attribute is to make 
a claim about ah.ihfihite class of observations. Since few of these.- 
observations will, actually be made for any Object of measurement, all . .. 
attributes are, in a sense, theoretical constructs. . : • 

Attributes" are "constructed" by specifying classes of observations. 
Measurements of attributes are based on samples from these universes, in- 
order to interpret the results of measurement as th^ value of an attribute for 
the object of measurement, we must generalize from a sample of observations to 
a universe of observations, and these generalizations are inductive.inferehces. 
A central concern of a theory of measurement is the justification of such 
inferences. ' - 



> ■ ^ ' in Heasurerrieht of Disp Attributes 

The measurement of S dispositional attribute jnvolves^a^gen^ or 
an_ inductive inference^, from a particular_qbservation to the_universe of 
observations defining the attribute. ^ These inductive" "inferences require . ;; 
justification^ and it is. the task of measurement theory to provide an analysis 
of the kind of justification that is requl^red. . 



THE USE DF INV.ARIANCE PROPERTIES AS INFERENCE^ TICKETS _ . ; 

/ The justification for seientifie inferenc^ is general ly prdvided by appeal 
to scientific laws. Hempel(1965) has diseusseds^the .thse of laws as^a basis for 
scientific inference in considerable detail, and\Toulmih{1953), wfiQ^ees the 
inferences derivable from a set of laws as definirvg the content of* t*% laws,- 
has suggested the' term "inference tickets'' for scia:itifie laws when they are 
used in this way. 

The type of law that is needed to justify the- interpretation of 
observations as measurements is aninvariance property. An invar iance property 
states that the results of a certain kind of observation' do not depend on some 
of the particular conditions of observation. Inyariance properties are 
necessarily involved in meastjr^^ assigns a value to ah 

attribute of some object of measurement on the basis of sofne set of 
observations. ^' 



A complete description af an observation would require exhaustive. 

specification of . all. of the conditions under which ttfe observation is made. 
Since the operational def initidh'df ah attribute 'specif ies Only spme;of; the . 
conditions of the observatfdris, thereby allowing the other ch^aracteri sties rtjo 
V3^y»_it describes a class of observations rather than a single observation. 
The attribute is identified: with this class of observations and not with^ahy- 
particular observation. For any attribute, the conditions qf observation are 
limited by the selection- rules but are not uniquely specified. Any qn^e of 
this class of observations could be used: to assign a value for the attribute 
to the object of measuremlgt. the observed score , Xpi, for an ob^rvatioTi 
is the real nurfiber assigned 'to the observation by the structural ,e}le^ -for the 
attribute. f— ^ - • 

A diffepe^- but equal ly Te score," Xq^i. could, be 

obtained/^l5y chp^giJig the conditions of .observation in accordance with the 
s'elec^n rul( 

ywhen an observed score is interpreted- as a measurement/- itr.is assumed^ '' 
imp^icity or explicitly^^ that this'observed score can/be taken as the value of 
the attribute for the OM. _ Since the dbservedscdres fdr difflrgnt observations 
may bgvassigned to the value of the same- attribute; fdr an OM, the fdlldwing 
relationship.must held :dt least ^approximately in order to maintain consistency: 

X- ^ ^ X-. ^.^-. - ■ " . V (3.1) 

01 • : ■ : 

■ ' ' ' • . ' • . . * ' - - ^ ' 

where i and i ' re^^Sent any two observations for the object, d,, that 'TOet -the 
specifications oT tfie^^pe?ation;il definition for the attribute.. That^ is^. the 
observed scores must^%Be^;^ln variant over the universe of observations defining 
the attribute. 'Since *^;{two quantities in EqfS.ll are based oh obseryat:idhs, 
this assertion is test^l^^r ^^ny pair^.d^ observations, arid 'Eq(3.1) is an - 
empirical l^w. 



_ ■ Eq(3.1) is ah .irivariarice prdper;:)^;^s.^^ the observed scdres are 

"^•invariant over the universe def Triin'g-'ihe attribute. For a given, dbject of _ 
^-i*^ measurement, -a, all observations included in the defiriitidri ef>the attribute 



should assign the same value_to the object of measurertieht. If all observations 
do assign the same value . to the OM^ this value .i.s taken to be the value of the 
attribute, Uq, for the object of measunement/ o, .J 

. ^^ote that the inyariance prpperties. required for the measurement of an 

attribute 'depend on the def Ihitida of the universe for the attribute b^ihg - 
- measure^. If. the two quahtitiesiSh EqtS.l) were not taken as^me^surements of 
the same attribute, for the same 0M,: there would be no reason to require that 
they should have the same v-alue. , : • 

Gdhsidering again the example discu.ssed earlier^ if arfxiety is ifiterpreted 
5s a trait, the situfitions in wh^ich observations are made are cgnditigns of 
/ observation, and invariance over these situations, is assumed. If anxiety is 
interpreted as a state^ the objects of measurement are person^,iin situations, 
and changes- in the value of the observed score as a^functidn of the situation 
are consistent with this interpretation. ^ 

" Invariance prlsperties are involved in. fneasurement because they jiistify 

inferences from sampip of , observations to a uni If all. 

of the observations in^the universe give the. same result^ for any object of ^ 

measurement, .then any on^ of these observations provides complete information 

about the universe. If Eq(3.1) holds for all .pairs of .observations defining ^ 

an attribute^, it provides the necessary^ justif ication for inferences* from 

obseryed_ scores to :u'ni^/er^e scores, ra'the extent that observations fail to 

.satisfy Eq('3.'l-), sudh inferences are not justified; Therefore, the inyariance 

•property in EqtS-l) -is necessary for the interpretation of observations as 

measurements of dispositional attributes.. _ ■ ' . ■ . 

■ . . ■ ' fc 

: If the observatidn's ;in the UG for each object of measurement do not all 
yield the same observed scores a variety of values are_ass1gned td a single 
. / quantity, the universe score. " Therefdre, vidlatidhs df the inyariance prdperty 
in Eq(3.1) imply incdnsrstency in inferences frdnr dbserved scores to universe 
scores, Jf the magnitude of'' the "discrepancies is generally "small, it maybe 
pdssibleltd ignore them,- andlto treat Eq(3.1:) as an approximation, resulting 
in what Suppes(ig74) has called a deterministic theory without a. theory of 
-error.- The -alternative fs to jntroduce an explicit theory of errors; 

FRReRS-OF MEASUREMENT ' , ; 

In order^ to develop an "index for the accuracy'of this apprdximation, and 
• therefore of ^ the dependability of inferences frdm{ observed scdres td universe 
scores, the concept of an error of measurement is introduced. The result of 
any dbseryatidn dn an dbject, d^ is taken to be the sum of the;"true" value of 
the attribute, Uq, plus-an error of measurement," Bq^. ; ' * • 

- '^oi ^ ^0 ^ ^oi ^ . ... ' . ' (3,2): 

Since neither ^the "true" value nor the error of measureriient is directly '. 
observable, Eq(3.2) is not a testable hypothesis; rather, it is a definitidn 
of the variable^ e5^-. For any dbs.ervatidn and any_va_lue df UQ,_the valu6 

; . ^or e^i can be chosen sd that the twd sides of Eq(3.27 are equal, and 

' "/therefdre£q(3.2) is a tautoldgy^ ' 

However the- vSjues .assigned to the error term ia the formulation presented. 
' . above are jTiotarbitrar^^^ Given the UG for 'an a^ttribute and'any' value for 
Uq, the magnitudes of the .errors are determi If ^e / ' 

observations on a given'object of measurement vary widely^ the^ magnitudes 
-assigned to ..the errors^ must be . large, and if the observat ion s have 
.approximately the same value,, th can be taken to be smajl. 

i^ea^urements with srrial T errdrs df measurement are,- df course, general ly 



preferred. to rrieasurernerits iqvJDlvihg large errors of measurement, and Eq{3;2) 
provides the basis for a relative eriteridri for: the dependability of 
measUremerit. 



Although this development has ass^ thatu5 is a constant for any. 

object of measurement, the value of this constartt has not been specified,. In 
both the physical and behavioral sciences, Uq is. generally equated _with the 
mean over all observations alftwed by the. operational deftriitiisri of the ■ 
attribute: * _ 

Eq(3.3) -determihes a unique value:df the attribute for each 0M. ; This* choice is 
a convent i on, which is both convenient and plausible, but ''it is a convention; 
The most- compel ling reason for this convention's^ that it minimizes the 
meah-^square error. " 

• With the value of an -attribute defined by £q(3.3);^:it is easy to show that 
the expepted _value of the errors of measurement, as defined by Eq(3:2):, is zero 
for each" object of fneasurement.; ' . ' ' ' 

The error variance for each object of irieasul^ement is given by:' 

- - - " • ' • . — - - - - - - ' - — 

'The .error variance iri Eq(3.5) is a measure df the dispersion "in estimates of 
the universe scdrg fdr each -obdect of measurement. Since Uq is a constant 
for each OM, the error variance in Eqt3.5) is equal to the observed score 
variance for the OM. Where they canbe estimated, the^error variances in 
Eq(3^5) ar€ very useful because they provide; an indication of the accuracy of . 
infereoces from observed scores to universe scores, for each OM. In the - 
physical sciences,^ the precision of measur^nt is often reported for each OM * 
.in terms of the square root of ' Eq(3.5) . HoWever^ the direct estimatign of 
this error variance requires repeated observatidris dn each. OM, a"hd this is 
often riot practical fdr behavidraV measurements. • 

fi more easily estimated parameter is the expected error variance over the 
population, as given by: ' 



He,,) 



^^^^^'=^'^%0 - . ' (3.6) 



01 0 



Eq(3. 6) provides an Indication of the accuracy of inferences frdm .dbserved 
scores to universe scores, averaged over all dbjects df measurement. AlthoygH. 
Eq(3.6l doesn't; provide seperate. indices.df the accuracy- of measurement fdr 
each OM, it does prdvide a useful dverall index of accuracy for the 
pdpulatidri. The average^error variance can be estimated with pairs of 
bbservatidhs dn" dbjects df measurement. 

- • ___ w .__ • - : 

Using Eq(3. 4), the covariance between uni scores and errors of 

rrieasaremeht, can be shown to be equal ..to zero: 



- "HeQ^j) = E(u V u)E(e = 0 : . (3.7) 

oi ■ d i 



where u is the^ scoreover .the population^ and the covariance 

L?_t?!S?^_9y?r_the population and oyer the UG*_ Since. the errors of measurement 
are independent of the universe scores/ the observed score variance can be 
partitioned as: . 

where. o'2(u«) . is the variance in the- universe sbores over the population * 
and ^^{eQ^} is the error variance taken over the population and tHe^^fe^^^^ • 

Although scientists; would prefer to recognize the presence of . some error 
of measurement rather than add coFhplexity to theories, they still seek to /. 
miriimize the maghitude of such errors. This. is usually done by standardizing 
the conditions of observations in rrieasurerrieht ahd'by basing each measurement 
on more than one observation. The choice of the mean oyer the universe of 
generalization as the value of th'e attribute is consistent withvthis tendency, 
to try to: minimize errors of measurement. • _ 

ERRORS OF MEAS^JREMENT "AS CONSTRUCTS • V - : _ _ ^ , \ 

~ In .th€ absence of any assumptions about the" tDbject of Wasurement,' the- - 
concept^ of an error 'of measurement is unnecessary. If we; restrict our 
attention to observations, there is no reason to reject the hypothesis that 
every observation is perfectly* accurate. : If rheasuremerit assigned numbers to 
observations, therefore, it would hot be necessary to introduce the. concept of 
ah error of measurement. ^ 

Suppose, for example, that two observers. put mercury thermometers into the 
same glass of water^ at the same time.. Suppose- further that one of the 
observers records that the mercury rises to mark labeled 60 and the other 
. observer records_that the mercury rises_to a mark labeled 58. These two 
observations differ in several ways • If the two numbers, 60 and 58^ are 
assigned to the observations, there is nd>eas-dri to assume that either . ; - 
observation should be said to contain any error. The two observations 'occurred 
as they occurred. The concept of ah 'error of measurement: ariseii ply 
attention is shifted from observation to measurement. The assumptions a 
invariance properties that are involved in the measurement of any attribute 
force us to introduce the concept of an error of measurement. 

-The'usual analysis of the example given above, takes temperature to be the 
attribute, and_the glass of water at a particular time to be the object of 
measurelTjeht. _ The temperature is assumed to be a functidri of thg water and the 
time. This implies that the two observations described above should agree 
with each other, as' indicated in Eq(3%l). That is, temperature ' is assumed to 
be invariant over thermometers and over observers. 

However^ any two observations on the same object of measurement will, in 

general J produce different numerical results. ^ Since measurement is intended 
to map each object . into one real n urr per, theory mjjst b^ adjusted in' one of two 
ways. One approach is to redefine the objects of measurerrierit so that the 
measurements which disagree with each other involve different objects of 
measurement. In the example given abdv^jthe object of measurement cduld be, _ 
redefined to be a small volume of water in the glass, at a given time as would 
be the case ihihvestigations of thermal diffusion. Since the two thermometers 
must be atdifferent positions in the water^ the differences between the two 
measurements can be explained by the fact that they apply to different OMs. 
This approach resolves the inconsistency between assumptions and observations, ■ 
but it does so at the cost of greatly increasing the number of OMs to be 
considered.. .... 



14 



. Ah alternative appr'daeh. leaves the. cfefiriitidh of the object of measurement 
tiriehahged; btit ihtrdduees ati. explicit theory of errors. It is thereby, 
recdgriized that the dbservatidris used in measuremerit depend on the Mnditions 
of observation as well as the object of measurement; . 



/ For many applications^ the expected error variance iq. Eq{3.6) is not a 

very good index for. the .dependabi lity of measurement. .The magnitude of the 
error variance can be changed sifnpTy by changing the scale (e.g. inches to 
^: feet or meters) , and the evaluation of a measurement procedure should not 
* depend on such an arbitrary choice; Therefore, it, is not the absolute 

magnitude of the error variance that is significant, but the magnitude 
relative to the degree of precision needed for some purpose. 

*. ^ . ' " ■ 

The degree of precision, or -tolerance, required of measureipents varies : 
from one area of science to another. The astronomer who is measuring the 
distances between stars can tolerate errors of thousands of kiTometers/while 
a crystal ldgrapher_ might consider an error of a thousanth of a centimeter to. 
be unacceptable. Between these two extremes, lie^ a continuum of pdssible. 
. .. tplerances, inclading those, of the engineer. whQ wants, the separate parts df a^' 
. bridge to fit together.* If the' error. variancfjiji^ E^£3.5) based- dh'the same 
scale asvthe tolerante, the dependability of measurement can be evaluated 
directly by comparing the estimated magnitude* of the error^^t^ 
This is the usual method -for evaluating the precision of measurement in • • 
physics, and since the tolerances in a particular' area of investigation are ." 
usually well known, it is commda practice td reportthe square root of .Eq(3.5) 
dr Eci(3.5) as an index of the precision of measurement. * ^ 

The practice' of reporting the relative magnitude df errors of measurement 
is sufficiently general in the physical sciences, as td be ihtrddUced in the 
ftrst chapter of a^rj introductory textbook {PSE,1958,pl4) :. 

"If a surveyor measures a distance with great care he might get 
100.132 meters + 0.3 cm. His work is a great deal more accurate 

• than that ddne when the widt.h of-a book page is measured to the ' 
nearest millimeter with a ruMer, even though his .error is ; 
something like three times as big: as what anyone wduTd perhaps _ ^ 
make on the page in ten seconds V wdrk. This ..sdmetimes finds ^\ 
expression in another way when the estimated spread .df _ 

' ' : measureSents, the tolerance , is stated, using decimal fractidhs, 

or percentage. Thus the surveyor would, say his. length was 

100.13? meters ^ 0.003%, while the page is u^t 20.1 cm. + 6.5%." 
* • — 

The emphasis dri stating the magnitude df^the errors of measurement in. relative 
'terms has.beeh even mdre pronounced .in the' social . sciences (Lord and Novick^ 
1958, p.252}: ^ : - 

";i.the effectivetiess of a test as. a measifririg.instrument usually* 
does not depend merely on the standard error of measurement, but rather 
.on the ratio of the standard error of measurement to the" standard ^ 
deviation of observed scores in the group. The more discriminating the 

• test items, the larger will be the 'standard deviation of observed scores,* 
other things tieing; equal ; and hence, the Jess will be the danger t ' 

. - true differences will be swamped by random errors of measurement and lost 
^ - to view." 

A suitable index fdr the relative magnitude of errors of measurement is • 
suggested by the relationship betwe^ dispdsitidnal attributes. and- the rank 
ordering of the properties of observations. . As_a mininiaT requirement, the 
errors should not be so large as to cause significant fluctuatidns in the 



. ranks* assigned to 0Ms;fr^ one set of observations to another. If the_universe 
scores for two objects of measurement are. Uq and Uq', errors of measurement 

-.which are less than the absdjute value of (uq-U5i)/2 will not distort the 
ordering of the observed scores for these two. objects.-. In comparing these two 
•objects .of measurement^. .therefore, . an error variance whi.ch is smaller than 
[uQ-UQi}2 canbe considered ^a re Tat ively. small error variance. _ (see. 
Crbnbach and GTeser, 1964, for more detai led' analysis of sigrial-td-noise 
ratios) 

A more direct way of .evaluating tfie consistency of the ranking of objects 
of measurement from one set of observations to another is to estimate the 
correlation between observed score based on independently sampled 

• observations. . Correlations_ indicate the degree of linear relationship between 
two variables, but, in the absence of very serious departures from linearity^ 
a..correlation- coefficient depends_mostly on the consistency of the rankings 
from one variable to. the other. .Therefore, correlations are useful statistics 

- for evaluating the consistency of the rankings of observed scores. 

If the only purpose of measurement were to reflect the rank order^ 
objects on some attribute, rank order statistics ■woy^ld^Be mo 
tltctn correlation coefficients' in evaluating measurements: However, 
measurement provides an expl ication of ordinal relationships rather than just 
^ representing this ordinal relationship. Measurements are intended to assign a 
number on an interval scale_to each OM to. represent the value _of an 
attribute. Correlation coefficients are. appropriate indices for the precision 
of such numerical assignments, while .rank-order statistieswould not be : 
^appropriate fy€i^ thia purpd'se. Therefore cdrrelation cdefficients: and indices 
that are closely raited to^correlatioh cdefficients ( i.e. general izab^ ity 
coefficients) haye-beeh widely used in evaluating the dependability of 
measurements. . In particular, correlation coeff icents constitute the basic 
mathematical machinery in classical test theory. 



ERLC 



THE ROLE OF THEORY ; _ 

This analysis of measurement errors depends on the fact that certain 
..- assumptions are made about measurable attributes. Iri_particular_it is assumed 
that attributes. *are to be applied td specific kinds of objects df measurement, 
and that certain inv^riance properti will'^hold. these assumptions "are. 
theoretical in the .-sense that they form a cdhhected body, or network, of 
general laws; -The criterion in Eq(3;3) is a theoretical ideal which is never . 
achieved-in practice. Errors of measurement may be viewed as concessions to 
the_brute_ fact that the world of observations is. not' as neat and orderly as we 
might^ like it to be. ' , . ^ , " , 

Postulating the existence of errors of measurement makes it possible to. 
minimize the number _df objects of measurem^ that need to be considered, _ and 
therefore to simplify both descriptidhs of phenomena "and the theories designed 
to explaih.phehornepa. The resulting gain- in cdhceptual clarity. is usually 
Worth the loss of precision involved in relegating the effects of s.ome 
conditions of observation to error. 

The introduction of an. explicit theory of errors represents a decision 
not to stijdy some kinds df phenomena. In^ the examples discussed ab.ove^ the_ 
decision to attribute the dif^ererice between the 'two thermometer readings 58 
y and 60, to errors of measurement is essentially a decision not to investigate . 
r temperature.variatidhs within- the liquid; this decisidh, which is riot dictated' 
t • by empirical findings, reflects a choice arhprig several possiblevresearch 
strategies. : • 

The designation of certain' sourc measurement 
is -a conceptual choice rather than an empirical finding. Errors of 
measuremejlts provide a way df, handling observational variations that-are not 



to be given ah explicit description or explaha particular stage in : 

the_develdpmeht of a science; In order to make its task more managable^ every 
scig|ce. tends to restrict the phenomena that it treats explicitly. As the ^ . 
science develops'^ it may be able to analyze' phenomena that had. earlier been - 
relegated, to error yariance;_thi^^ decreases the error variance^ and enlarges 
the sphere -of phenomena treated by the science^ but' there fs always some 
variation which is intentionally left unexplained. 

The specif icatidh <5f tHe attributes in an area of science arid of objects - 
of measurement determine how observations are described and organized, and • 
this influences the kinds of questions addressed by the science, i,e- the 
paradigm for the science, A change in the definitions of attributes arid 
objects of measurement, which is equivalent to a qjiange .in the definition of 
error, represents a shift in the way that phenomena are perceived and 
described. If the attributes Jhat are redefined are fundamental/ the 
resuTtirig changes may_be sigriificarit eridugh td be called a sciefltific 
revdlutidri {Kuhri, 1970). 

■ ' . * 

For example, the chctriges. introduced int^physdcs by the special theory oX_ ; 

relativity are basically changes in the concep.ts of length and time;_ . : 

specif ically,they_are changes in the set of irivariarice properties associated I 
■with -length and time. In classical mechanics, Jerigth and time'iare assumed to- > 
be invariant with respect to the observer; in the theory of relativity^ this } 
invariance property is rejectedv and the object of measurement is redefined; td. 
intrude the observer {mdre precisely, the dbserver's frame of ;yefererice}^_ 
Special, relativity had a revdlutibriary:^impiact ori.physics because it mocfified 
the furidamerital cdricepts df lerigth arid time; _ arialagdus chafiges.iri less 
important concepts would have' had a much, smaller impact.. (See Frank, 1953, for 
a very lucid discussion of this point) 

• Although the formulation presented' here assumes the existence of some set f 
of^basic assumptions^ these assumptions/dc^ not necessarily include aXmodel of. 
underlying constructs or processes. Attributes ^afe treated as disposftidns 
(see Carriap_, 1953)^ a>id there is no dntdldgical commitment td attributes as - , 
things. _ Attributes are defiried fri |erms df uriiverses df dbservatidns, arid^the 
assumptidris specify the syritax arid Fef?raritics df the descriptive l^riguage of 
some part of science. _ . - 

The existence of errors of measurement, therefore, 'depends on theoretical 
assumptions about attributes, and the operational definition of Irrdrs of 
measurement depends on the def initios of the attributes and objects of . 
measurement. In particular^ the definition-.of tfie objects of measurement, 
determines whether the observed differences among dbservations are td be 
interpreted as errdrs df measurement, dr as differerices iri the attribute fdr 
"STffererit objects df measuremerit. " ' 



EKLC 



17 



15 



IV General izafcKility theory ^ the:Ba§is for a - 
Sampling' Model /for Vaiidity 

_This disetjssidri.:df general izability theory is necessarilyrOnly a 'brief 
ddtVirie; A thbrdtigh preseritatieh of general izab^'lity. theory can_be. found in 
-The Dependability of Betiaviorar Measurements - (Erdribach. et.al,,. 1972), _ . 
Introductions to some of the basic ideas in genera lifability theory are found • 
in Lindquist"[l953) and in Brennan and- Kane (1979).- • ^ . ^ 

GENERAL IZ ABILITY THEORY 1* ' - ^ ^ 

T The purpose of both reliability theory and generalizabil.ity theory is to : 
characterize'' the dependability of measurements. Unlike rel i ability theory, /■ 
wfiich.treats errors of measurement as arising from a single source and uses ^ 
correlation coefficients as indices Of .rel i-Sibi 1 ity, general izabi 1 ity theory; 
recognizes the existence of multiple sources of error, and; uses a variety of 
^Itifaceted designs to estimate the variance components for different sources 
of ernor^ , ^ ' • 

In generalizabijity theory, any observation on an objects of measurement _ 
is. assumed to be sampled from a^universe of ob§ervatidns. ^ The observations ; in 
this universe are characterized by the conditron.s under which they are made, 
and^ the set df.all aOnditions of a particular type is called a facet . For- 
example, if per^:5?>s are the.ObjectS. of measurements and the dispositional"^ 
attribute is a state variable," the universe cOuld include an item faCet^ an 
occasion facet, and perhaps a. rater facet. \. v ^ \. 

* Cronbach et>al.(1972, p. 20) draw a distinction between 6 studies , or - : 
generalizability studies, which examjne the dependability of measurement 
procedures, and D studies , or decision studies, whi<:h ^provide the data for. _/ 
substantive decisions. In this paper, the term, "measurement_prdcedure" , will^ 
often be used in place of the term "D study". A measu rement proc"edure 
incorporates a sampling :design. for obtaining observations an objq^ts of 
measurement that is. used over a number df separate studies. Th^ term . _ 
"D study", suggests that the sampling design for.irjeasurements of an .attribute 
is likely to change from one study to "another. Although the possibility of 
such changes in D-study design is explicitly eonsidered/at several places in 
•this paper, much of the discussion wi-fl'emphasize the effects of _ 
standaVWzation of the conditions of observations.^ Measurement procedure is^.a 
more descriptive term thaaT) study, when some facets are standardized over all 
observation?. ; ' • ^ i . ■ . ^ 

The distinction drawn between a. unfverse of_ admis^sitile .obser\Aatio and a 
universe of general ization /(Cronbach" et a}, 1972, p:20) is'based^on the _ 
distinction between B studies and 'D. studies. In conducting a G study, certain 
facets are investigated, and a certain; rSige of cdnditidns is considered with 
respect to each facet; , The facets inye^igated in the G study define a 
universe of admissit>1e rfeervatiohs ; M interpreting the observations ^n a D 
study as measurements, of an^-.attribute,; inferences are dra^^^ universe of 

observations that provides an operationdh definition of the attribute.* . In_ , • 
general izabi 1 ity theory, "this universe isv£jtl led the universe ^of general ization 
for the attribute.' • ' . v 

Jhe universe Of admissible observations T^s fissociated with estimation, and 
;indicates the facets for which variance components have been estimated in S 
studies. Since estimation issues are generally not addressed in this paper, 
few References, will be made'to the universe. of admissible observations. The 
concept of a universe of generalizatidh, which defines an attribute wil 1 be. " 
used'textensively in; paper; • In addition, another universe, hot discussed by 
Cronbach et,al.(1972) will later be introduced in )order to describe a_ 
measurement procedure.- . • 

18 



The purpose of the G study is to estimate components of variance, which may 
theji be.use<l to evaluate the dependability, of infe^^^ to the universe of 
genera li^ati on If. the components of variance estimated in a G study are to 
provide the ihformatidri needed for- evaluating, the D. study^ they must provide 
estimated variantes. for sburbes of error in the D. study. Therefore^ the 
universe of admissable dbservatidris -must include. the universe of 
generalization, S studies are most useful -wbeh they employ crossed designs 
-and large sample sizes to provide stable estimates of as many var.^ahce . 
^mponehts as possible,^ For any nreasurement procedure, there are^rnany facets 
fh^ might be^considered, but variance, components for only a few of these can* 
be ..wdependently estimated in any G study. "^Therefore, several S studies may 
be.raq to adequately evaluate the dependability of *a measurement 
prdCMure. , ' . 

The universe score , the expected value over _t}ie universe of generalization, 
is stipulated to be the value of the attribute for any objects df measurement. 
Unjverse^scores^are not directly observable, but can be estimated by the mean 
over a sample of observations; that is, fer each objects of • 
observed score is used as an estimate of the universe score^ In generaliz-- 

ability theory^ then^ questigns-about the reliability of a measurement 

procedure are replaced by .questions 'about the generalizabil ity of ^ob's^rved 
scdres, andthe dependability of such generalizations is described byva ^ 
gerieral.izabi lity cdeff icient.- . " 



Gronbach et.al. (1972, p. 97) define the coefficient of general izabi 1-ity fdr 
an attribute and a D .study as the ratio df the universe score variance to the 
expected observed score variance for the ,D study design. The universe sCdre _ 
variance in a general'izability coefficient_repIac^s the true score variance of 
classical test theory^ and the expected' observed score variance replaces the 
observed* score variance of classical test theory^:. 

^ ■ - / - - ^ : - ^ --- 

-Cronbach et- al . (1972.^- p.98) discuss two interpretations of the : 

gener'aliz ability coefficient for a D' study which samples from the "intended 
tiriiverse df.gerieralizatidn. First, the generalizability coefficient is 
approximately equal to' the correlation between dbserved scdres for two 
independent random samples of observations frdm the universe df generalization. 
Second, the generalizability coefficient is approximately equ.al to the expected 
value' of the squared correlati-on between the observed score and the- universe 
score. 



A. LINEAR MODEL . _ / ^ _J 

Generalizability theory allows for' the use of a variety of linear models in 

interpreting thg results df both G studies and D studies', depending on .the 
design of the study. 

*The universe of generalization typiea'lly involves a large humber of facets, 
and in principle the model Jfor^o could ^explicitly include any 

number of these , facets. Eor the sake. of simplicity, however, a simple 
•one-facet model with replications will be used as a basis for discussion 
throughout this paper. In tKis simpl^e models :one fatet is corisidere'd^ -^^ 
explicitly; all other facets in the' universe of generalization are assumed to 
be sampled randdmly and independently, and are subsumed under a single 
^replication facet. ^ The dbserved scdres for aTl dbseryatiohs in the universe 
df general izatidh are represented by the linear model: 



(4.1). 



where - ^ * " - 

ij is the grand mgan :^ * . ' • 

a is the main effect for the object of measurement, o " 

a^. is the main'^ffect for the. J_ facet 

^oi ^'^ ^'^^ interaction 
.€t is |the replication effect _ " 

■ . - - - - - - - - - -- ^ 

■ The linear model in Eq(4Urrepy7esents the observed scores in the universe 
of generalization; it is not intendid to represent /the: sampling vdesign for any 
particular^S.study or D study. The universe of generalization defines ah 
attribute for a population. . / . /*• . 

_ _ _ _ '_ _ _ _ 1 *i _ _ 

It is assumed that the^i\facet is crossed with/objects of ^re 
> the universe of general izatTon; that is, in the universe of generalization, 
there is an observed score for each po?ssible conibination; of an objects -.'f 
measurement and a condition from the i, facet. This does irbt necessarily imply 
that D stu'dies or G studies associated with, mg^asurements of the attribute will 
employ crossed designs. Each effect in the model is assumed to be uncorrelated 
.with every other effect. In addition, in g^er to make the estimates of :^6e 
effects unique, the expected value of eaeti/^fect /over any of its subscripts is 
set equal to zero, • - 

Eq(4.1) is essentially a generalization of Eq(3.2). The m^ 
between the classical test theory model in Eq(3,2) and the linear modei^ in 
Eq(4.1) is that the classical test theory model . assufnes the existence of only 
twosoUrces of variance in the observed, score, while the model in Eq(4.1) . 
explicitly considersifour sources of variance. The general linear model can . 
be formulated to include as many sources of vai^iance as necessary, and can be 
made to reflect the aesign under which the bbs^vatidns are made, 

T^e model in Eq(4.1) includes two facets, labeled 2 and and for each. of 
these facets, there is a universe of conditions from which the conditions in a 
particular study may. be drawn. These universes may be either finite or 
irifihite. Although the consideration of finite universes does not pose a ^ ^ 
fundaniehtal problem for generalizability theory^ it 'would compl icate the_ 
discussion, and for the sake'df simplicity, it iS; assumed in thi% paper that 
the universe of conditions for each facet is infinite. ■ 

From a S^study in which conditions of the facet are crossejd With objects 
of measurement^ £5 and rep li cat-ions are nested within these ai combinations, 
four^ components of ; variance can be independently estimated. The variance for 
.the four random effects in^Eq(4.1) are designated as &2(o), .a2(i ), 

In a' D study, the observed scores for objects -of measurements are usuaUy 
based on the sum or average taken over a sample of observations^ and capital 
letters will be used to designate the - average value of ,an effect over a sample 
of observations. The variance component for.the average value of the main" 
effect ,/arj./ over a sample of nj conditions of the i facet is given by 

^(I) = ^i'i)/nu . . ^ . . • ' (4.2a) 



Similarly, the variance domponent for the average value of the oi interaction, 
over samples of n] cdnditidhs' of the j_ facets is 

■•(^(dl).= o2(dT)/hi, ^. . (4.2B) 

20 



arid the variance component for the average value of the replication effect over 
samples of ri^ replications on each of nj conditions of the 2 facet, is f 



The fact that the vari-ahce cdmpbrierits in Eq{4.2) are divided by sample 
sizes is a_ref lection of the general statistical principle that sampling • * 
variances for means over. random samples ar>e equal to the sampling variances^f or 
siri^le observations, divided by the' number of obsevatioris in the sample. The 
relationships listed in Eq|4.2) can be used to estimate variance components for 
D studies^ involving any' number of c and , any number oT 

replications, once the required random effects variance components are - 
estimated in G studies. - , . - 

(Seneral procedures for the_estimatidn df.variarice compdnerits -from computed 
mean^^squares are discUssed_6y_Grorif ield and Tukey (19562, Gronbach et aV^ 
(1972), Millman and Slass(J9^7) , Brennan(l977), arid by most standard textbooks 
dri experimental desi-gn 7e.g. ^ Kirk, 1958; Wirier, I971j. 

MEASUREMENT PRO C EDURES BASED OH RANDOM SAMPLING FROM THE ^U6 * 
.7"^ A'lthough measurement procedures^that use random sampling from tihe iUnivirse 
of generalization are seldom used in practice, it_is cdriyeriierit td. start with " 
this oversimplified assumptidn. Letting capital Jetters designate effects for 
samples df cdriditidris, the observed scores for a D study wi^h i riested within 
d (a separate- sample of conditions of the a facet is .drawq tor each objects of 
measurement) can be representeci as: ^ 

•^oIR ^ " "- -^o ^Ql ^ ^r" / . , ^ . (^-3) 

where £ represents the object of measarement, 1_ indicates a sample of n^- 
conditions from the j_ facet, arid, R^ iridicates ^ sample df ri^ replications for 
each condition_of the i facet. ^ Agairi, _the replicatidri index^represerits the 
effect of all facets dther thari the. 2 facet, arid cdridiiions from these facets 
are assumed.to be sampled randomly arid indeperideritly for each observation. 
Since the effects in Eq(4.3) are assumed to^be independerit of each other, the . 
expected ot^erved: score variance over the 'population and over the. universe of 
general izatiori is: 



(4.4) 



' o^{X) = 0^(0) + /(ol) + ^{l) + o?(R) 

=>^(o) + e-^lo^ij/n. + &^(i )/n .+ -i5^(r)/n.n^ 
The univeirse score Uq, for the object of measurement,. 0, is given, by 

IK ■ > -v. 

the universe score variance is given by the variance of the|rrjairi effect for 
objects of measurenierit", ag: . / 

-''u)^ = fr(o) / ^(4.6) 



ERIC 



Where. observatidns_are rariddmly sampled from the. uriivers^ of generaTizatidri / 
fdr^each dbjects of measuremerit, the expected value of tfce observed score over 
repeated appTicatidris of the measurement procedure is eqifal to the uriiverse 
scdre^ arid the observed score is an uribiased estimate" df the uriiverse score. 



In analyzing errors df rfieasurement, Gronbach et.al,(1972, p.76J 
distinguishes between Jheerrdr in pdiht estimates df universe scores (which 
they^desigriate by a capita and is represented here by the. symbdi . S) and. 

the error in estimates of the universe score expressed in deviation f.orm:^. 
(which they designate by a small delta* and is represented here by. the symbol 
-d). Cronbach et •31,(1972) also^discuss^s a third kind of error^ which is 
based on regression est imates^ .but is not used in this paper.' 

The error of measurement for a point' estimate of Uq, based ph Xqir, is, 
DqIR = ^oIR - Pd • I ■ ^ - ' 

^ ' ^i .;' ^oi ' - 

Since i_and -R^are random ly^ for each 'observation on the objects' of " 

measurement, the expected value of Dqjr over the universe of generalization 
is zero^ an^ the observed score is an ujibiased'estitnate of the universe score 
for £. The expected value df the squared -error^ taken dver all instances of 
the procedure is equal to the average, error variance within the dbjects df 
measurement^ and is given by: - . 

EE(D^,J*= ^(I) + e^(ol) + cr^(R) \ • ' : (4.8) 

_ .r_ - - - ' _ _ _ _ _ _ _ _ _ _ . _ _ _ ■ 

If ^conditions df the 2 facet are sampled independently for each 
obS^K^ation, the expected value of Xqj^ over ah infinite population of 
objects of measurerf^^ is also an expected value over the universe of 
generalization and Hs equal to the grand mean, u. Therefore, if both 1 and R 
are nested within 0^ the error in estimating universe scores relative to the 
population mean is: 

' ^oIR = (^oIR - u) - (uo - u) . ' , , 

= aj + a. J + a^ ; . • ^. ^ (4.9) 

Since- J_ and R_ are independently sampled for each observatidh df_d, the" 
expected value df dpTR over the universe df general izatidn is also zerd. 
The expected value df the squared errdr, d§iR,, is equal to the expected 
value of the squared error, D§ir, -as given in Eg(4.8). 

m4in^^^(TT^^ : ^ (4.10) 

. ^ the covariance, takendver Ihe population, df the errors, DqIr, dh two 
administrations of the measurement procedure is given by: 

Since the j_ and £ facets are nested within 0 for the measurement procedure,' 
taking an expected value.dver £ autdmaticalTy involves taking expected values 
over 2> I'", and R' ._ therefore, the, expected value dver each df the 
crossproB'ucts in EqT4.ir) is zero and the errors,- Dqjr, are uricorrelated._ 
Similarly, the covariance of the errdrs, dQjR, for two administrations df 
the measurement procedure is also equal to zeroi 



22 
20 



In elasieal test theory., errors of measurement are assumed to have an 
expected value of zerb^ to be Uhcdrrelated across pairs of observations^ _and _ 
to be uncdrrelated with the uhrverse score. Errors of measurement t,hat satisfy 
these requirements will be referred to as randem errors , .It is clear from. the 
discussion_just_presented that - a meastirem^ procedure .feased on independent 
random samples from the universe of geheralizatioVsatisfies these three . 
requirements- ' Therefore^ as long as an instance of the measuremerit preeedure 
is defined as a randomly sampled observation from the universe of . - 

generalization, all of the effects contributing. to the error of measuremerit for 
a dispositional attribute, are random errors, ' . * 

Erenbath et 31.(1972) define ageneralizability.coefficierita^ the ratio 
of universe .^core variariee, which. is given iri;Eq(4.6)^ to the observed -Score 
variance, which is given iri\£q{4.4). . . ' ^ • 

ef(d) + ©^{oi)/n. + (5?(i)/n.+ ©-^(rj/n.n^ ' •.• " 

where n^ is the number of -coriditioris of thej_ facet sampled ^for' each measure- • 
merit, arid ri^ is the number of repTicatibris for each of -the^se coriditidns.^ 

^ ■ ■ _ . ■ ' ' _ . ' 

The coefficierit jri Eq(4;12) incorporates tests of two_separate':invariarice 
properties^ one fbr.tfie facet and>the other for- the replication facet._^ Since 
the replication facet represents_the effects of all but one of_the^facets in 
the universe of generalization, the second of the invariance properties is v^ry 
gerieral. If the bbservatibrisj in each dbaects of measurement are invariant over 
the 2 f^cet, the variariee cbmpprierits fbr" the i mean effect and the 0i_ inter- 
action must be small. Similarly^ if observatTdris are to be invariant over all 
'other facets iri the universe of gerieral i^zatibri, the replication variance / / 
'component must b^u|mall._ In gerieral, all of the "variariee cbmpdrierits that ^ 
appear as part of^fe^error variance in general izability coefficients are 
associated with invar^^ance properties;. Torthe exterit that- these variance \ 
cdmpdnents are close to zero, the in vaiq^nce. properties are- good 
apprdJcimations. ■ * . , - 

\ Note that rid assumptidris need to be made about how the. errors are ^. 

distributed^ ^Iri particular, it isn't necessary to assume a normal. distributix)n 
•for the errors 'in orde?^ to estimate generalizabil vty cdefficierits. Arssumptions 
aboutthe distributiori df errors are used in eoristructirig confidence intervals, 
• - but confidence intervals will not :be die'ussed in this paper. ■ _ ■ 

/THE INFLUENCE OF SAMPLE SIZE" ON GEN€RAL4ZA B I L ITY - - • ^ - . 

. -In classical test theory, the error variance is undiffereritiated,^ arid ■ 
^ Increasing the number of observations averaged to obtain' an observed score 
leaves the true scdre variance uric-hanged and decreases the error variariee^ 
Where ari observed _scdre is defined as the meart over a sample of observations 
, for the objects of measuremerit, iricreasirig the size of this sample deceases 

the ^sampling variariee of the meari. Th.is regularity is the basis f or ;the ^ 
^ . . Spearmari-^Brown formula for changes in. the length of a test. <^ ' 

For the cpef f^cjerit' iri Eq(4;I2) , th.e relatioriship between the irror 
variance and the- number of coriditions sampled for a facet is riot_sd simple. 
^ . •However, by using Eq(4.'l2), it is.^ possible to predict the general izability 
% coefficient fdr any number of conditions of the 4^ facet and any riumber of. 

replications. The fact that the general izabiTity coefficient can be predicted- • 
' . for various edmbiriatidris df sample sizes for the different facets makes it " .-■ 
--p^sible.to maximize the.deperidability* df measiirement, for a fixfed number of , • 
♦ ^ervatidris. Iri general; this is- accompli shed by sampling most thoroughly. 
those facets that make the largest cdritributiori td the error variance'. 



The faet that it facilitates the. design of efficient, measurement procedures 
^ is one important .advantage of geheralizabi lity theory.. Hpwever^_an egua11y 
import ant. point for the. purposes of this paper' is the facttN'at the tQtal 
error variance for a- measurement procedure can be made arbitrarily small by 
increasing the sample sizes- " thereforev_i^^_the variance cbmpbhenti ^2(o)-, ^ 
is greater than zero^ the general "izabi lity coefficient in Eq(.4,12) aoproBches 

- a limit of l^O^ as the sample, sizes for\al3-/facets approach infinity (F^^ 
facets with a'.fin^te numbeV, N^^^ of conditions^ tlie_yariance cqmppnents for 
the facet go to zero^- as the- sample size^ ni^ approaches Ni)- Therefore 
incre'asing the sample sizes for various facets provides a simple way of • ^ " 
improv'wg ,the dependability of any inedsuremefit procedure.., . . • 

♦However, there are practjcal 1 imits on hoW far this method canbe pursued, 
and for important attributesV_i-t, is often impractical to achieve 'satisfactory 
dependability of .measlirement by increasing^ samp^ , Later in 'this l^aper, 

more sophisricated approaches , the problem of measurement errors will be_; 
discuss^^d. Although th&se .techniques make it possible to cc^t^rol Errors af 
measurement wi^Hgut inordinately large sample sizes, they alJ^O TtiakE It 
necessary to replace -the si m^^^ universe sampling model discussed, 1ri this * . 
section by more Complicated models:. ^ : - 

' A UNIVERsi"s?^PLIN6 MO.B^^ VAblBITY . - - _ - 

Numerical .estimates of genera-lizability e>oeff3ciehts. are. developed in two . 
, steps. -pirst/ eomparients of vaVlanci;-ar^e e'st jmated in G studies. Second^ , 
gerieralizability.coef^^ estimated" variance . - 

' ' cdmpdnerits -and th€ sample sizes for; the. D study; - - . , , ■ \ 

. - - . Since' the universe scorg^fpr each giDaects of- mea:surement has been_ 
■ stipu-Mted'to be the value o^ .the 'attribute for the'Db3ects_of measurement, a 
measurement procedure is valid'td the extent that it accurately estimates the . 
uaivjerse secure.: For a measurement procedure:, consisting of * random sampl ing 
from the un-1 verse of generalization, the observed score is an unbiased 
estimate of the universe score, and the random errors assumed in Eq(4.7) are 
.the dn.ly sources of errorVin the measurement procedure. Since the y 

- gea^r^alizability .coefficient provides an index of how accurately universe ( 
scores can be .inferred, from observed scores^ it can be interpreted as a 
validity coeffieient. — ; ' ^ 

' '_ By defin^ition, therefore, the va-Tiie of -the attribute 'for an object of 
measurement "is the universe score,- dr the: meari dver^a^- observations in the 
universe/of generalization for the object -of measurement. - If this universe 
score' were- available, it would be a perfectly valid measure. of the 
iTiSpdsitional attribute. However, the^universe score, ^is^sneral ly fiQt 
' • available and samples of gbserv^atiorrs must be used to- estimate *it. Jhe 
primary requirement for measurement of an attribute is therefore that it 
provide: accurate estimation,- of universe' score -for the attribute; This leads 

* to the ffillowing defiriitidli: df dispositional validity . 

A measurement procedure? is said td be valid fdr a ^^isposi tiona-fc^ 
attribute to the'. extent that, it provides dependable estimates of 
^ ,the unjyerse score!f^r the universeof generalization defining 

the attribute.. ^ • -^r ' ' . - - ? 

^ The validity of^ measurement procedure is an index df the accuracy of _ . 
" inferences' from a \sample mean to the mean over the universe of generalization,, 
where'. accuracy is ^f ined by the expected squared error or by a' coefficient of 
generalizability. Val idity. is a'matter of degree, rather than_anan-or-ndri 

• property^ and depend's dn bdth the measurement procedure and the attribute. 



24 



' The value of the^generalizabill^ coefficient ;dep-ehds bri how thdroughly 

: the* measurement procedure samples tfte universe of generalization and :thjs^:is 
determined by the design of the measurement procedure and by the defiriitidn of 
the- attribute: being measured. In particular^ the more narrowly'' the universe" 
"of generalizatidhi is_tdnce.iv more, dependable the measurements, wi 11 be. 

Crdnbach, et al (1972, .p. 352) pdirit out that, "Investigators _dften choose 
procedures for evaluating the reliability that impl ieitly define a universe 
narrower than their substantive theory calls for. When they do.so,, they 
underestimate' the ^error' of measurements; that is, the error of 
generalization"! - • * ^ ^ ;* . • . 

In a sense^ the , def inition of validity given above is similar to the ^ • 

classical notion of criterion validity^ -with the universe, score- being taken as 
t^ie criteridn. .Hdwever, this, srniilarity is. relatively superficial. Unlike 
most df the criteria used with criterion val idity,. the universe score is anv 
abstraction, a parameter defined on a universe df dbservations."^ Since the \ 
universe score is.not directly observable, it isn't possible t6 estimate the 
\ validity by correlating observed scores with universe scores. 

* Although the universe score is an abstraction, and therefore n^ directly 

observable, it can be a relatively well-defined abstraction. To the extent 
that the universe of generalization is clearly def ineti^ the accuracy, achieved 
'in estimating universe scores 'can be estimated by a generalizability 
cdeffic-ient, atjhd therefore the va;lidity_df_the measuremeht procedure can be : 
represented by this coefficient. The clarity df definition df the universe df 
ger>eralizatioh is ah issure that would be considered under the heading of 
content validity (Gronbaeh, 1971) . ' , V ^ 

Therefore, if the operational definition of a dispositional tePm were 
clearly specified, and if random samples cpuld be drawn from the universe of 
generalizatidn associated with this definition^ validation would be relatively 
straightforward. Unfortunately the universe of generalization is usually not , ' 
so_clearly defined. _ .This does not detract from the appropriateness of the 
definition of validity .given above, but it. ddes_ impdse .^dme limitatidrfs dh the \ 
application of this definition. However these limitattdhs .are hdt new; they 
are closely related to the general problem of induction which arises In all . 
scientific research. Establishing the va1i^ity-^f a measurement procedure 
reguires the,'jOTpirical^v of a number_of invariance properties, and 

this task IS not necessarily a simpler task than the. verif icatidn of other • 
efTip.iriC:al laws. The problem of induction that arises in verifying scientific" 
laws and sdrn^ of the sdlutidns. that have been pnopqsed will be discussed more 
fully in -a subsequent section. ^ 

• The fact that a generalizability cdeff ic:ierit can "be an index of validity. J. 

may be surprising s:ince generalizability theory: is usually seen as an extension • " "" 
of reliability theory. However, the interpretation of Eq(4.12^ as a validity - 
coefficient is achieved only by making the strong sampling assumption that the 
^. observed sedres are based on random samples, from the universe of generalization 
(Tryon^ 1957; McDonald, 1978)>. For most measurement procedures^ observations 
are generalized to universes -of generalization that are much broader than the 
;: . ^ universes from, which they are_^ampled. It is not unusual ^ for example, for 
:.v'," inferences to be drawn abdut broadly defined universes of behaviors "d-n the 

basis of .responses to a particular type df written test items. .'$irTji larTy^ a 
series df weight measurements may be dbtained with a particular spring. In 
neither of these examples is it reasonable to assume .that the ob-servatidhs are 
a random sample from the universe of generalization for the.^attribute being 
ineasured, and therefore a generalizability coefficient basedon these 
observations would not be a validity coefficient* * - ' • \ 

therefore^ this simple universe sampling model presented in this section 
ddes ridrprdvide an adequate analysis of the validity, of , the great majority of 



* itieasurenient_ procedures s which :dD_ngt consist of ^ra 

intended universe of generglizatign. For most of the^attributes^t^ 
interest in the behavioral sciences^ standardization is .used to. contra errors 
iDf njeasyr^rtieht, which are often unacceptably Jarge 'when abseryat^^ are ^ 
randomly sample^d from the universe of generalization.. Standardization inv.oTves 
an explicit dec i si dri hot to use random samples from the Universe of 
■ -generalization iti est imatlDg universe scores; - ^ standardized measurement 
procedure samples observations from a_ universe "w is a subuni verse of the 
universe of generalization, and therefore requires a somewhat more 
^ sophisticated model for validity than that presented in this section. ^ 

Another method for controTlIng errors is the use of stratified; sampling • . 
designs rather than, simple random .samp li-ng, Tor example'^ ^ assessing the / 
dependability of generalizations from the items on an achievemen-t test to a 
universe of items, the assumption that items are randomly sampled' \j^j$hin strata- 
is undoubtedly much more, realistic than: the assumption of siraple random 
sampling. (Stratified sampling is discussed, by Rajaratham, Eronbach, and 
Gleser, 1965;) • . 

In general, an analysis of the dependability of a measurement procedure ; 
should. reflect the sampling .design for in the measurement pr:^)cedure^- and - ' - ' 
general izabil ity theory mak^s it_pdssible_ to do this. in a systematic way. 
Although the more realistic.samplihg modeJs add complexity td.geheral izabil ity 
analyses and may cause problems in estimation, the analysis of these more 
elaborate, sampling designs is often especially informative; in a later section,, 
convergen^i -yalidity will be shown to be equivalent .to a ;genera1izability 
analys?Sv'V?.ith standardization of the method of observation, • . 

• Also^ it is typical lyXthe case that there are- uninten3ed violations q 
sampling assumptions in thk G study. The^gffects of departures from the random: 
samp1ing.assum_g/EVdn be estimated accurately^ and therefore the interpre-- 

tation of th^/resjdlts df G studies must always be' somewhat tentative. The • 
violation df sampling assumptidhs is, of course^ a general problem in research^ 
and the clouding of interpretatidhs that results from such violations isn't 
unique to generalizability theory. ^ ■ 

V However^ sampling problems are. especially acute in the interpretation of:, 
. general izabiVity coefficients because^the estimation of these coeff \ . 

requsires sampling from both a_ population of objects of measurements and a ^ 
universe df generalization,. Although the pdpulation, consisting as it-ddes of 
the.objects that are of. primary interest to the Researcher, is likely to be as ^ 
weTl defined as other pdpulati^dns invesfigated^^^^^^^^ universe of 

generalizations ddri't even meet. this rather Iddse staridard. The universe of 
generalizatioh, which; by defin^itioh, consists df facets that are. ndt being 
.systematically investigated, is likely to be more pddrly defined than the . 
population; ' - 

Unintended violations of the sampling assumptions may\introduce bias^into 
. the samples df some of the facets being investigated in a G study, and, given. . 
the universal applicability df Murphy' s law, it would be unrealistic to assume 
that the estimates df variance components Will be robust aga-inst such sampling 
biases; These considerations suggest, of course, that every effort shduld be 
made to avoid violations of the sampling assumptions. However^ it wduld alsd 
seem prudent to ioclude some explicit recognitidn of^ the jpdssibil ity df 
sampling bias into the interpretations of generalizability coefficients. In 
the last section of this paper, itwill be'shpwn that if generalizability 
' analyses are interpreted as tests of assumptions.about invariance properties; 
it is possible to make tliese interpretations less vulnerable to -yiolatidns df 
the sampling assumptions; the price to be paid for this increased security is 
a Weakerlihg df the '.conclusions drawn from various studies. - • 

ERIC W 



ERIC 



V StahdardizatTon, arid the Universe of Allowable Observations: 

0rie way to Refine the Sampling Model : 

As indicated earlier, th^ an explicit the.orv of . 

it possible for relatively simple theories to provide a consistent account of 
a. "Wide range of observations. The inconsistenfcy that would otherwise arise in 
these simple theories because of violations of in^ariance properties is. 
accduflted for by errors of measurement, Furthermore^ the magnitude of these 
errors can 'be estimated, and the effects of such errors can therefore be taken 
into account in interpreting current observations arid iri predictirig future 
observatibris. 

Although the- introduction of a theory of errors has several- advantages, _ 
these advantages are most pronounced when the errors involved^are small; The " 
smaller the error variance^ the more accurate the inferences that can be drawn 
from drie dbservat'iqh td aridther dr from an observation to the universe of 
general izatidri. .It is desirable therefdre thai the error variance be as smaT! 
as possible.^ • . • ^ 

-:Jhere are .thr^ ways to decrease the error variarice and therefore to 
' increase the^precis1dri:;bf measurement; The first way to decrease the error 
variance^- is^: to baset each measurement on a larger sample of observations from 
the universe of generalization. This approach is widely used in both* the 
physical arid behavioral sciences, and is d-iscussed in detail by Cronbach ; 
et.al.(ig72). Drie advantage of gerieralizability thedry is that it indtcates 
how to obtain the greatest iricrease iri precis.idri fdr a given increase iri the 
number of observatioris sampled. " 

A second way to reduce errors is to restrict the universe of generaliza- 
tion^ The more narrowly the gni verse of generalization is defined the smal ler 
the errors will.b^. In the limiting case, ifan observation is not generalized 
.to any wider universe, but is interpreted as an observation, there is no error 
df measurement. Although narrowing the_ universe of generaTization decreases 
the error :variance, it cari alsd_limit the usefulness df the :measurements, and : 
is, therefore, riot "a pariacea. This, approach' is discussed in the next section. 

The third method for control l ing errors of measurement _is to staridardize 
the measurement procedure. Standardization can be very effective iri reducirig 
errors of measurement, but ^it can also be misleading, and therefore requires 
careful examination. The remainder of this section is devoted to the 
implications of standardization. 

STANDARDIZATION OF ^MEASUREMENT PROCEDURES \ ■ ^ . 

~ Since errors of measurement result from variations in the conditions of 
dbservatidri, ^these errors may.be. reduced by contrdlUng^ dr standardizing^ the 
conditions of .observation. *^If the dbservatidris dri. ari dbject df measurement . 
var^ as some facet varies,' these dbservatidris may.be made mdre cdrisi stent by 
"making alT^obseryations wi^th the same condition of the facet. If all 
applications of a- measurement procedure employ particular condition, or. s'et 
of conditions, of a facet, .the measurement procedure^ is said to be standardized 
on the facet. - - 

Standardization of the 2 facet changes the design of the measurement' 
procedure sd_that_the Objects df measurement are. crossed with the same 
conditions, I*, of the r^facet.for al 1. measurements, but it ddesri^t alter the 
definition of the attribute. Staridardizatidri df a measuremerit prdcedure is 
riot^iritended to igiply a change in the universe of generalization, which 
continues to include the fu-ll universe of conditions for the i facet. 
Therefore^ the universe score for object, £, is still -Uq, as given by 
Ecj(4.5). ■ ■ ; ^ ' ■ . ' 



The observed sedre For a measurerrierit procedure with the 2 facet 
standardized can be represented by: 

The expected value of the observed score over repeated application of the 
standardized measurement procedure is given by: : 



Fdr^ a . standardized measurement procedure, therefore, the observed score is. a 
biased estimate of the uriiverse.scdre, unless the last tv^o terms in Ect(5.2) . 
^happen \o be zero, 'This bias also appears in the errdrs for'point estimates 
' of universe scores, giveh by: 

• Dqi*r = Xoi*R - Uo 

Eq{5.3) is the same_as Eq{4.7} exc^t fdr the fact that in Eq(5-3), tlie first 
ternv is a constant for all- observations and the second term is. a_ constant fdr 
all observations on a particular object of measurement; in Eq{4.7), all three 
terms are random variables. The expected value of the error, DoI*r, over 
repeated observations on the object, 0, is given by: 

■pdliR^ =^^^I*^^^oI*^^R^ 



The amount of bias in the estimation Qf universe scdre^ is indicated by the 
two terms on the right side of Eq(5;4).; 

The expected squared error for object, _o, is given by . 

R " « 

Notice that the first term in Eq(5.5) involves the -sum of two constants rather 
■ than a variance component, and that the second term is the^ variance component 
for replicatidris. 

The.expected value df Eql5.5) over the 2 facet is the same as theexpected 
- value ;0f the squared errdr, DqIR, for the unstandardized prdcedure, given by . 
Eq(4;8). Therefore, standardization on a randomly chdsen set of cdnditidn^ df 
a facet does not, decrease the expected squared error for point estimates of 
universe scores. 

If 2.* can be. chosen so that (aj* + aoi*)^ is smal 1 compared to the 
S:um of _the first two variance components in Eq(4.8)^ the expected squared 
.errdr fdr the standardized measurement procedure will be smaller than the_ 
expected squaretf error for the unstandardized prdcedure. A biased estimate 
^ with a small variance is often more useful than an unbiased estimate with a 
large variance; However, the problems of estimation involved in cWdsing a 
"good" value for i* are substantial (See Eronbach et a.l, 1972, p. 101)- 



(5.3) • 



(5.4) 



28 



it may happen that there ts no choice of -t* that significantly reduces the 
:sguared_error, and if an can be found with a small value for af*; this 
choice may involve an unacceptably value for a§j*.A^^ possibility 
is to estimate the value of aj* by "calibrating" conditions of the -i facet, 

and subtracting the estimated value of .ai* from a1 1 . observed_scgresj_ _ 

however^, this is equivalent to using regresS/ion estimates with a slope of 1.0, 
arid if this 'approach is, to be Lis:ed. at • al 1 , it .wdtild probably be better to use 
standard regression estimates for the observed scores. _ The use .df regressiciri 
estimates of universe scores introduces a ttiird type of error (Crdribach et al, 
1972, p.r06-l07), which is not discussed in this paper. 

Therefore, standardization raay;decrease the error fdr point estimates of . 
- * universe scores, but does not necessarily do so. Standardization is a much 
. more promising approach when observed scores are .u5.ed to estimate universe 
score? relative to the average universe score in the population. If all . 
observations' hav^^* as the conditions of the j_ facets the expected value.**of ' 
the observed scorg aver the pdpuT'a.tion ■ is: 

Uj^ =.u t aj^ ■- • * •• .'^^ 

and, therefore j- ; - • 

= a^I* + a-R (5.6) 

The main effect, aj*, does not appear in Eq(5. 6) becaus4.it is a constant 
. for all observed scdres, arid therefdre has ho effect on the differences 
between dbsgrved scdres arid the mean observed scare. 

_The expected value of d|pi*R, over repeated applicatioris df the 
standardized measurement procedure is given by: 

poi*R) = ^01*' - ; • ' ■ ^^-'^ 

Therefore, the standardized measurement procedure -is also, biased in its 
estimates of universe deviations scores, but the magnitude of th^e bias, 
consisting only of the interaction effect, aoi*i is smaller than it is for 
point estimates of universe scores. Note that the expected value of this 
specif ic bias, over .the population^ ^is zero.. Unless aol* is zero for all 
objects df measuremerit, therefdre, universe .deviation scores^ are' systematically 
overestimated for sdme dbjects df measurements arid are systematically 
underestimated for dthers. " • 

The expected value of the squared error over repl icati^ris for* object o is: 

The expected value of Eq(5.8) over all . possible choices of 1* is: 

mdl^*^) = ^ -<^{R) ' - ' . . (5.9) • 

I R ■ ' ' ' , 

• Therefore, the squared errdr, d^j^p, is expected td. be smal ler fdr the . 
• standardized measurement procedure thari fdr the uristaridardized measurement 

ERIC 29 



- •- - ( • 

procedure, given by Eq{4,rO),Veven i^f standard oh rari'ddmly ehdseri 

conditions of the 4^ facet, ..Furthermore, if an i* is available for which the 
specif ic systematic error.-iS: part it is possible by_ * 

standardizing the j_ facet -.to "ob'tavri a prpcedure with a small bias and with an 
expected squared error that.:is muth smaller than 'that of .the un^ 
measurement prpcedure. (Again^ sampling problems make it very difficult to 
make ah dptimaT. choice for I*). . 

The rtiairi advafitage in stahdardization is that i^t ,cah be used^ to reduce 
error variance; . In practice, standardization of a facet is most useful when 
observed- scores' are used to estimate' universe deviation scores and-the 
variance 'for the main effect of .the_i_facet, o^Uh is relatively Targe. 
Standardization of the J_ facet automatically el iminates^o2{l) from the err^^ 
variance for comparative decisions, and if regression estimates are feasible, 
can eliminate e>2(l) from the erjror. variance for point estimates. Although 
it is sometimes possible to choose conditions for the -2 facet so that the 
expected value of a§i* over the population is small, this goal is hot 
easy to achieve. ^ 

SYSTEMATIC ERR9RS - - - — ' 

Standardization is a powerful- tool for reducing the magnitude of errors of 
measurOTent. As indicated above, however, the realization of behefi.ts from 
this technique may require judicious selection of the conditions^' 1*, chosen 
for standardization^ and this is not a trivial problem. Furttiermore, there is 
a price to be paid' for the benefits of standardization. 

If the cdhditioh of the 2 facet, is the same fdr*all observations^ the 
effects, aj* and aQj*, are constants over replications of the measurement 

: procedure. For a^given objects of measurement, therefore, standardization 

. changes some components of the error of irieasurement from ^r to 
constants. Components of the error that, are constant for all observations on 
an objects of measurement are called systematic errors . The 'effect, aj* is 
a general systematic error 'since it a constant ever all observations. The 
interactidn effect, aoi*, which is a cohstant for each objects of measurement 
but. may vary from dhe dbjects df measurement to another^ is a specific 
systematic error . ' * 

The systematic errors have heither_of the two defining properties of' 
random errors. First, the expected^ of the systematic errors dver 
repeated application of the standardized measurement procedure i:s not zero. 
Therefore^ systematic errors introduce bias into €st*jnates of the universe 
scores. The main effect for the ji facets is the same for all objects of; 
measureneht, and represehts a geheral bias, which is present in ^ bft not in 
./6. The interaction effect, agi*, is_a specific bi'as fdr each objects df 
measurenent, £, and it affects bath D and id. Since the systematic errdrs are 
constant for each objectsof measurement, they do hdt terid td "cahcel dut" 
over a series of observatidhs. ^ 

Second, the systematic errors are correlated across independent 
administrations of the measurement procedure. Since the expected value of^ 
Eq{5.4) over the population is aj* the covariance between the errors, fi, on 
twd' independent admihistratidnsldf the standardized measurement procedure is 
given by: (Ldrd arid.Ndyick, p. 181) 

_ _ _ _ ■ ■ «^ _ 

= ^(51*) (5.10) 



ERIC ^ 




28 



Thusi, the errors of rrieasurerrieht for the stahdardized ;measuremeht procedure are 
correlated and the rriaghitude of the cbrrelatibh depends bh -the magnitude bf 
the specific syst'erriatfc error sT ^ ' r- 

Similarly^ th^ expected value of Eg(5^,7) oyer the population i^ 
the covariafice between the .errors on two independent administrations of 
the . standardized measurement procedure is given by: 

^ "^^^^^Ol^R'^Ol^R^^^^ - : ' : . ; 

. = ^(bi*)^ / ■ ■ ; . ' ' 15-Ai) ' 

Since thfe systematic errors are 'correlated .across o^^ 
have a mean of zero^ they cannot be interpreted as the kind, of random error^ 
that appear 'iK reriability coefficlents. The interpretati-orT of systematic 
errors raises issues usually associated with validity, 

THEUNIVERSE BF AbbBWABkE GBSERVATIDNS 

in stan'd'a<:dizing the 1 facet by requir-ing that every measurement involve • 
the conditvefts^l*, a.hew^k of universe, the universe of allowable 
observations , is introduced. The universe of* al lowable observations is a 
subset of the universe_gf generalization, and ^ the 
universe of^eneral izat ion that have the appropriate condition for each 
standardized facet. Anihstance of the standardized measurement procedure is 
a randomly sampled observation from the universe, of allowable observations. 
By.coritrast, an instance of the unstandardized measurement procedure is an 
observation randomly sampled from the full universe of general izatidri. The 
universe of allowable observations defines a measurement procedure in the s^ame 
way that the uniyerse^.of generalization defines an attribute,- both are 
*' extensive" def ii?4<jons; 

Since'.standardization does not change the universe of generalization, 

measurements involve inferences to the' universe of generalization rather than 
the universe of allowable observations. However^ because it is easier to 
sample from the universe of allowable observations^ , an investigator who is 
evaluating a standardized measurement procedure will .often iDegin by examining 
the dependability of inferences from observed scores to the mean over the 
universe of allowable observations. To do this, a G study* can be conducted 
with observations randomly sampled from the universe' of allowable observations. 

If the i facet is standardized as in the univ§rse of allbwab ; 
qbservations^ all observations in the G study Involve l*^ and the 2 facet is a 
"hidden facet". The effects involving*. the i facet are confounded with other 
effects and cannot be estimated independ'entTy. An observed score can be 
written as " « 

where R is a replication index representing ihe combined effects of all facets 
other than thej facet. The terms enclosed '^i parenthesis in Eq(5.12) are 
completely confounded, and, in analyzing the G Study, the model equation may 
be t^aken as: * 



Twd?yariance cdmpdnents wduTd be generated by a G study wi th the J .f af et fixed 
as I*; arid these varianee camporierits. eari b'e writteri as:' " : 



where the variance components on the f^ght side of^Eq(5.l5) are thdse that; \ 
could be estimated in a G study in which conditions of the j_ facet are randomjy" 
sampled frdm the universe of general izatidri and are crdssed with the db^ects df 
measurement,- • . i 

The investigatdr whd cdnducts a G study TTi which^all observations are 
sampled from. the universe of*allowable observations will estimate the universe 
score variance* by Eej(15a), '^and the expected observed score variance by: 

= ©-^(o) + e^(oI*) + o^(R). '^(5.15) . 

Using Eqs(5.15a) and (5.16), the gerieraTizability coefficient would be ; 
estimated as: . , . ' . - . 

. ■ °^ V(d')*^(R') . . •: .. 



Eq(5.17) assumes -that general ization is over rep.l i cat ions, and^ not over the ji 
facetj and indicates the dejsendabilitj'. of inferences from observed scores to 
the mean over the universe of aLldwable dbservationsi, -in Eq(5.2}. 
The dependability bf inferences to the universe score is given by the ratio of 
the universe score variance in Eq(4.5) to the observed score variance for the 
standardized measurement procedure, given by Eq(5.15): • ■ \ y'-, 

' '^^^^^oI^R'^d^ = ' ' (5.18V\T 

' £9(5.17) is approximately equal to tMe expected correlation, over the j 
population, of_two independent administrations of the standardized- measurement 
procedure. Since Eq(5.l7) ^indicates the. consistency among the observed scores 
derived from the standardized measurement procedure^ it .qan be interpreted, as.sg^ 
a: reliability cdefficienl. Since Eq.(5. 18) refTects the agreement between'- * 
observed scores and.the value of the attribute as given by Uq, it- can be 
interpreted ;as a validity coefficient. * • ♦ 



^• fiANBflPr ERRORS MB-REblftB lb ITY . • ' . 

- -* The quest ion of validity has been taken to :be Equivalent tbthe question of 

^^how.^w^ll the:^res^ of a measurarierit procedure estimate the universe sfcBrei . 
this question is answered by deterrriining how we^ the results of : the^ 
ment procedure . satisfy the invariance properties jmplidt'^1^ the operational 

detinition of the dispositional attribute. Because standardization is so 

cdrjfndhly employed in designjng measureiTient procedures^ however^ the. operational 
defihitidh of the measurement procedure is usual ly not the same as the 
dperatidhal .def ihitldn 'df the attribute'; .the observations, that are actually 
used to estimate universe :scdres are. typical ly drawn' frdrfr a universe df 
allowable observations which is a sub-universe of the universe of 
, general ization, A natural question to ask,^then, ns how well the results df- 
particular instances of the measurement procedure generalize to the universe 

' of allowable observations. This question is equivalent to the question ofhow 
well repeated admlnistratiofis of the testing procedure, .(i.?. repeated samples 
df obseryatidhs frdm the universe df allowable observations) agree with each 
dther. This .issue is usually treated under the heading df reliability,. 
Within the sampl ihgmddel , reliability is defined in^ terms of the universe of 
aTlni^'able observations - - _ ^ - , ; . ^ 

' A measurement.. procedure is re^H its bbs^erved scores, 

provide dependable estimates of the mean over the universe of allowable 
. observatiohs. ' ;^ 

Note that reliability' is ^fi^ied as/a property of a measurement procedure^ cind 
does riot deperid dh' the -'defihitidh .of the a^^^ As, noted earlier^ 

dispositional validity depends dh Both the* measurement prdcedure and the 
attribute being measured..". This distirictidh is' cdhsisteht ; with the traditional ' 
V definitions of reliability and val idity.' Reliability provides ah index of 
consistency among the scores from independent admihistratiohs of. .a^ measuremerit 
procedure, ^and validity indicates -the rgjat^ 

measurement procedure and an j^nterpretatiqn of these results which goes beyond 
the definition of. the measurement procedure.^ *-' 

The reliability df a measuremght prdcedure .i^; an index of the consistency ■ 
amdrig the dbserved scdres in the universe df alJdwabT^^^ This 
defini tiori j"s equivalent to the defihitidh of reliabil ity fdV rahddmly _ 
'parallel tests it the-"true score"_is equated.with.the mean over the uhiverse 
of allowable observations^ The reliability of ' a measurement' prdcedure is 
^limite^' by thelmagh errors only; ^specif ic -systematic errors 

•tencf' to increase the reliability..- . f . . - . ; 

if the i facet is. standardized'^ the ^interaction . effect ^ ag|*^ is a 
5yst§iiatic>errdr, .arid is ihcluded- in th'e -humeratdr df the reliability 
cdeff iciehV.in E'q(5.I7) . Since variance" tompdnents;_ are posit ive^ . " 

^?^rXo"l*R»uo).:, which' indicates 'tlie- d^^^^ tnf 'erences- f rom 

pbservatidhs to.the mean dver the uhiverse of allowable observat'ioJiSi it is a 

:reliab'ilty coefficieht, while Er2(XgX*feHd whibh indraates.the- 
depehdabi lity of inferences from-. dbsefev^^ over the uhiverse of 

^generalization, is a' validity coeff>ici?ht. Therefore, the inequality in 
Eq(5.19) -restat.es the well-T<nowh resutf ;from classical test tfedry that 
reliability :is :an upper bound for yaTidfty. ._For the samp this 
result, can b€:^interpreted as reflecting the^f act that g the 

^.universe of allowable observations is always at least as. dependable as 

'gerieralizatix5ri td the 'mdre brdadly defined, universe of generalization. ^ 



Altfioup a B stfidy iri^^w 1 facet hidden does riot address the 

central Tssue of the dependability of irifererices-f rem observed, stores/ to 
■ universe scores, it^is usefurfor-two reasons. ^ First, it prdvides ari tipper 
: - bound on the validity' of the measurement .procedure; A reasonably high, value ' 

for the reliability coeff icient, "Er2(X5t^UQii does 

measurement procedure as having a high validity, but a low value can establish 
that Jthe jDrocedure has a low validity,; .In;;a subsequent section^* the _'_ 
' importance of such one-sided tests/will be <iiscussed.. Second, the G study _ 

with.lhe 2 facet fixed does prgvide an.estimate of the random error variance, 
-o'^rr); which is needed .for estimating t • 
' « Er^(XQj*RiUQ). If o2(oi) is, estimated in ajsubsequerit G' study,, the 
, validity coefficient could be estimated -directly; : . 

Of course, if the unT^erse.o'f generalization had only two.facets, the ' 
■ investigator could estimate the variance^comporients for both facets in a 

single study. The need for more than, one G . study arises from the fact that_ .■ 
most universe of gerieralizations have many facets, and only a few facets can 
be systematically . investigated in any G study, ^ S^nce Targe sample sizes .ar.e 
generally necessary for the :accu"rate •estimatidriidf variance .components in 
designs with as few as^'two facets (see Smith,- 1978h ah adequate arialysis of 
the general izability of a measurement procedure will ^usually ^require a riumber . 
of G studies. ^ : : 

SYSTEMATIC ERRORS AND VALIDIXY - ^ ^ 

. ''The difference between Eq(5.18), which has been interpreted as a_ validity' 
coefficient, arid Eq{5.17)i which has been interpreted as a rel lability . 
eoeff icierit,- is " iri the rdle:played by ©^(ol*). A rel iabi 1 ity coefficient ' 
■ indicates the consistency of observed scores froii one administration of a - 
measurement procedure to another. As indicated by Eq{5. 11), ^0^(0!*) is the 
' covariance of the_errors of measurement oyer ; repeated dbservatidris dri the 
ob3ect,.£. therefore^ the covariance Betweeh^the observed scores, on two/, 
iridependent administrations of the standardized measurement -^roce.dure (two 
. iridependent samples from the universe^qf allowable observations) Mncre as 
e^fol) increases. , Therefore ^as't he magnitude of the specific -systematic 

. errors, iricreases, the reliability increasesV- .t^^ magnitude "of ^^(ol) 

provides informatiori .about, hdw well thl db-servations dr^wn from the .universe ^ 
of allowable observations correlate with the universe sodre fdr the attritiute 
befhg'measured.^ This would .usually be interpreted as a question df validity. 

. . Ih classii^l test theory, the "true score" for ah objects of measurement : 
. • " isdefinedas Jhe_expected yalueof the observed score aver repeated 

application of a measurement procedure to. the objects of measurement. Tor a 
staridaptJI zed measurement procedure, the expected value over repeated . - ; 
applicatidri df the measurement procedure implies taking an expected value over 
R, but not Over L- Therefore^ the mean over the universe of a]1qwab1e : 

Fbservatidris is* ariilagdus^td the true score of clasisical test^theory. ..--^ 

Although the stahdardtzed. measurement procedure produces biased estimates of 
-^the mean over the universe df general-izatidri, _ it does provide unbiased 
estimates of the mean over the universe of allowable dbservatidris. In the 
limit as the^number of replications approaches infinity^ the magnitude of the 
random errors approaches^zer'o, and -the observed score approaches Uq. 
, ■ Therefore Uq is a parameter for which the measurement procedure provides 
■ ' unbiased estimates. Since the measurement procedure is intended to ^provide 

estimates df the universe score, Uq, the correlation between Uq and ' , ^ 

.Uq provides ari iridex df the agreemerit between what ttie procedure actu^'lly 
estimates without bias arid what it. is iriterided* to measure. The squared 
correlation .betweeri u^ arid Uq is given by: 

o . ; • ' • • ■ : ■ ^ / 

ERIC 



" g. - '(5.20) 

Ecj(5,2Q) represents the-^cbrrelatidri between the Universe score ah d.ai|. observed 
score for which the sampling of the universe of allowable observations is 
sufficiently -thorough that random errors can be ignored. In addition, 
£g(5.2dl provides an upper bound on the validity of measurements based on a 
standardized measurement procedure. This correlation can also be obtained by 
taking the limit of:Eq(5.18) as approaches iThfinity, and therefore equals 
the limit of the squared correlation between t^ observed score and the 
universe score as the s^sgple size for the observed score approaches infinity. 
For an observed ^cdre ba^d on the average of a finite number of observations 
from the universe of allowable observatiohs,_the squared cbrrelatidn between 
the observed score arid the universe s£ore will be less than dr equal td 
Eg(5.20). ' ' ' , 

Eq(5.20) can be interpreted as a validity coefficient corrected for 
attenuation^ and citi be represented in a form analogous to classical 
attenuation formulas. 

2, * ' . ^"^^^0i^>"0^ . . P^^ 

^^oW^o^ . . 

where the numerator of Eq(5.2l) is the validity coefficient given by Eq(5.18), 
and the denominator is tine reliability coefficient given by Eq(5.17) 
Therefore, Er2^ug*,Uoj represents adisattenuated validity coefficient 
for the standardized measurement procedure. , ' 

THE REblABILITY-VALIDITY PARADOX ^ . > , . 

The inference frdm the dbserved score to the universe score can be 
decomposed ih'to two parfs. The 'first part is ah inferehce from ttie dbserved : 
score to the mean over the universe- of allowable observations, ahd the second 
part is an inference from the mean over the universe of allowable observations 
to the me ah over ttie universe of generalization, the un^iverse score. The 
cdefficieht fdr inferences from dbservedVscpres to universe scores can be 
factored to represent the separate contributions of these two inferences: 

K K ■ ^ 

\ ' - 



o^fo) + ff^(di)/n. ■ o^(d) ' ' ... 

(5.22) 



(o) + &^(oi)/n-. + o^(r)/n.h^ e^io) + cy^(di)/h^. ^ 

The first factor dh the right side' of Eq(5. 22) is the rel iabi 1 ity of thev_' 
standardized measurement prdcedure and represents the dependability of . .> 
inferences from' observed scores to the expected value dver the uhiverse df 
allowable observations, Uq. The second factor on the right side of 
Eq(5.22j is a disattenuated validity coefficient .and represehts the'^" 
dependabil ity. of inferences from Uq to.Uo- ' 

Assuming that the total number of observations, ri^pj., is to be kept 
cdhstaht, the value of Eq(5.22j'.is max^imized by maximizing nj. The 
systematic" errdr variahce, .©^(di )/nj^ . is. inversely. prdpdrtidn nj^ 
while the randdm errdr variahce, cr2(R), is cdhstaht as Idhg as the total . 



nuniber of dbservatidriS .doesn't change-. The validity of the standardized 
measurenlent procedure* irriprdved by-decreasing the sampling variance for the 
di interaction effect, and this samp^Hnl variance is decreased by increasing 
the sample size for the j facet; 



However, attending only to the reliability of the measurement procedure 

leads to the opposite, conclusion. The reliability coefficient given by' the 
. first factor in. Eq(5.22),^ is maximized by setting rii equal to one; .because 
this maximizes the ioi_ interaction variance^ .(^ which is iri'eluded 

in the numeratDr_df the.^rel lability cdefficierit/ However this minimizes the 
disattenuated val idity cdeff icient, whicfci_cdristitutes the second factor in 
Eq{5.22) and thereby::minimizes the overall validity; Therefore, attempts to ' 
increase the reliability of the measurement procedure by standardizing a facet 
may. decrease the validity; This phenomena hasjbeen called the 
reliab'ility-vdlidity paradox; A closely related phenomena called the 
attenuation paradox, is discussed by Loevinger(1954) and more recently by Lord 
and Novick(1968, p334). / 

CONVERGENT VALIDITY . . _ 
.Of the three main types of validity- construct, criterion, and content, ' 
dispdsitidrial validity is mdst similar to construct validity and may even be 
considered a part of construct validity^ The definition pt an attribute 
implies invariance properties; These inyariance properties are laws that 'can 
be tested in G studies; If all of the invariance properties are tested and 
verified, the measurement procedure is valid. If some of the inyariance 
properties are tested and no yiglatip are detected^^ the validity df the 
procedure ispartially supported. If even~dne invariance property is 
seriously violated then the procedure is invalid, -and -.either the .measuremerit 
" procedure or the interpretatidn of the attribute must be revised. 

The cdririectidri between invariance properties and constm^t validity "can be 
-made more explicit by examining how convergent validity (Cam&teJI and Fiske, 
1959) can be applied to dispositional attributes^ Convergen^^al idity of a 
measurement procedure can be investigated by letting the- conditions of 2^ 
represent different types of observations (objective tests, ratings, 
^ observation procedures^ etc.). For each .measurement a single cdriditidn from 
the 2 facet is sampled fdr_al 1 objects of measurement, and n^ replications 
are sampled independently for each object.' The expected observed score-, 
variance fdr these measurements is ^ . . . 

V{Xq^r) = 0^(0) + a^(oi) + o^(r)/n^ \^ (5.23}" 

^ Since the unfverse score is defined as the expected value of the ob^seryed 
score dver^both the 1 facet and replications,^ the universe score variance is 
given by:©2(o); the general izability coefficient for this measurement 
procedure is: . 

. - 2 - ■ ^(Q) ^ • 

Er^^^niR'^n) = 3 . 7 .^ ^ (5.24) ^ 



^ " 0^(0) + o^(oi) + o^(r)/n 



r 

_. _ ^ . _ _ . _ . r . _ . : 

: ^ .The ihtraclass.cbrrelatibn coefficient, in .;£q(5. 24) is approximately equal 
. ^ io the. expected value, taken oyer the population of objects, and the^universe. 
df methods and replications, of the correlation between two sets of scores 
based on independently sampled methods. The ia^ a 
■ validity coefficient depends on the interpretation of the attributer. that is^ 
it depends on the assumption_.that the attribute is. not linked, td a particular 
method. The value. of Eq(5.24) can be used to check on the accuracy df a. 
hypothesized invariance dver methods. ^ ; * 

ERIC ■ ^ 38 



Convergent validity os generally evaluated b;^^ the same. . * , 

characteristic by several different methods, _andestim^ 
taken, over gbjects^ between the scores obtained 'on the various methodsr If 
these cdrrelatiohs are high^ then convergent validity is supported, 
Edriversely^ if these cdrrelatiohs are ..low, convergent validity is not 
supppdrted. The gerieralizability coefficient in Eq{5,24) is approxiniate. equal 
to expected value of the correlatidh between dbserved scores obtained with 
pairs of methods randomly selected from the universe of general izatidrii_ It. 
therefore provides a measure of the average convergent val i-dity dver all pairs 
of methods. * ^ . 



Assuming that 2 represents the method facet, the systematic errors aQi 
are the object-methdd ihteractiohs; .For a particufiK niethod r*^ aoi*- . 
represents the: -Specific systematic bias that resul^Sfrp using methD'd I*. 
Since the expected value of agj* taken o\ter all- all cbhdit:i.glis of the j_; • 
facet is 2ero,^{oi] is the expected value of ft lctr:ge value .fdr ; 

o2(gi) means that method has a serious effect on measurement;. If o^(oi) 
is zerb, method has no influence on the measurefnent of deviration scores. 
Discriminant validity is discussed in a subsequent section.' 



ERLC 



35^' 



In nieasorihg an attribute; Uq, for the^ of the effects, a^ i^^ 

apj are components of_the error, the] argerth^ are, the more ; 

diffictflt it is to' obtain dependable measurements of u^), and, therefore, the. 
effectSj ai and a5i^ are generally viewed as nuisance factors to be. reduced 
as much as possible. As described in the last section^ standardization ' 
provides one way of dealing with these errors^ but standarization introduces 
systematic errors.. Furthermore, tKf fact that dbSerVati strong 
dependence oh the'^ facet should be of interest in ii^elf, aside from its 
effect on the_ inferences to Uq. Where this dependence can be described by 
an empirical law, a powerful technique for controlling errors .;^f measurement 
becomes available. / 

The errors introduced into measurements of Uq by_the J_ facet can^ always 
be eliminated by shifting attention to a new .attribute/ Uq\^ which iiivolves 
thje same* kind of operations that ar^' used to 'define the attribute, Uq, but 
which Kas_as its objects of measureraerit the pairs, oi^, .instead of the drigina 
objects of measurement, o. The universe ^scores,, Uq^, for these new objects 
of" measurements are fduri¥ by taking the expected value of the observed scores 
over rep'lications: 

r 

This redefinition of -the objects of measurement changes a-j and aigj from 
being part of the error to being part of the universe score. If the objects 
of measurements are defined by ^ as well. as o^ the difference between the two 
universe scores^ Uq] and Uqi-*^ Involving different conditions of the 2 
effect^ is taken as a substantive differenpe rather than as an error of 
measurement. Therefore, the two compdnenti^vaoi which areequal 

to the .difference .between Uq-; and Uq^- • bicdme components df the universe 
score/ . . 

Measurements of UQ^ are^^moref dependable than measurements of Uq, 

because the interpretation^pf,;Uoj is narrower than that of Uo_and thus ; 
involves,: inferences that ar^ 1es^s susceptible to errors than those impli by 
ugVg^^lil^-lej the 'Origiria characterizes ^ for all conditions, 

df the- characterizes\d fdr a particular 
;H:dhdTtidrf* o^^ thfe j_^^^^^ 

For each object df measurement, oi^, the universe, of generalization of the 
attribute, Ugj, includes observations with different cohditidris of the 
,.repli cat ion facet, but with constant values for^and a. Therefore, 
•ijiferehces from observed scores to UQj i^ dve^^r, but 

not over i. ^ The universe^of generalization for the attribute UQ^Includes 
observations with different values of both; i and r; and inferences from 
to Uo'-invdlves generalizatidn dO^r^ the vi_ effect and the £ effect.* (The 
generic term, "effect" is used fn this section rather than the term "facet", 
because sdme df thre dbjects df measurement being considered, are d by a 

corabiriatl£)ri of a cdhditidn of thej£ effect arid_the.2 effect. Therefore,, 
neither the J_ effect nor the £. effect separately specifies_the objects of 
measurement. However, it is also trne that the 2 ^"^^ ^ effects are not facets 
of the universe of generalization.) . ' ^ 

if the observed score, Xq^r^ is used to estimate U5f^ the expected 
value over replicatidns^. the only' sdur.ce of error will be the replication 
facet.. Therefore^ the dependability df inferences frdrtl Xq^r td Uoi is 
given by: 



(6.0). 



: ©^(d) + ^(Oi) +^(i) + ^(R) 

the coefficient in: Eq(6.1) is atjprpximately equal to the expected value of the 
squared .correlation between the Observed score^ Xq^r, and the universe 
scores . Uoii • ' " 

When the universe of generalization is restricted to a particular, 
coriditidh of tfie^i effect, ugj becomes the universe score; and Eq(6;l); % 
which reflects. the dependability of inferences from Xq^r to Ugj, is a 
validity coefficient^ with the_^ combinations as the objects of measurement 
and generalization: over R. Thisvalidity coefficient Er2(XoiR^Uoi) is 
never less_than the validity coefficient^ Er2(XQiR^Uo)i with conditions 
of; the £ effect as the objicts_df rneasuremeht, and 'generaTizat ion over _j_ and 
R,'- Restricting the universe of generalization improves the validity of: 
measurement whenever o^foi) or ^^(i) is greater than zero. (By contrast, 
for nonzero values of^^(i), standardization of- the i facet always improves 
reliability but does not necessarily improve validity.) . 

Theincrease in validity obtained by restricting the universe of 
'general-ization* is part of a. trade-off in which decrease^ in the errors of 
measurement are obtained jby narrowing the interpretation that be given. to 
the attribute. A high value for Eq(5.1) indicates that an inference from the 
observed score,. XoiR, to the universe score, Uqt, is dependable, and 
therefore provides justification ^for such inferences. If Eq(5.1) is to be 
taken as 5. validity coefficient, generalization must not go beyond the 
restricted universe of generaLization^ which i particular value of i; 

that is^ inferences are to the un'i-verse scare, Uq^, for the' restrl^cted 
universe of generalization in ..which both o and /i are constants for all 
observations. " The value of Eq(;6.lj* does not* indicate the dependabi lity of 
inferences from ah observed score obtained with one value of 2 to universe 
scores involving different values of j_. In particular, a high value for 
Er2[XQiR,UQi). doesn't justify rhferehces from Uqv to UqI • th;e_ \ 
universe score for the same value of 0 and a different value of 1, or_to Uq, 
the expected value^ qf^UQ^ over all values of i. Therefore,^ hi ^h value for 
Er^(XoigiUoi ) provides support for only a relatively Timited set of ;: 
inferences; > . ' . 



- The expected correlation between UQ^^ and Uqi* where j_ and 2^ are 
independently sampled cdnditions for each' .observation, is given by: .. 

... (0) • . ■ 

Eq{6.2) is also equal to the the expected squared corelatiori Er2(uQj^Uo) , 
. between the universe score, UqI, for universes that are restricted to. a 
specific condition of the i effect, ^nd the universe scores Uq for a universe 
that includes, the 2 effect as a facet; . s _ _ - 

^- If the universe scores^ U5i^ do riot vary much as a function of i,^ then 
cr2(bi) and 'e2(i): would be small ^ and the coefficient in. Eq(6l2) wpuTd be . . - . . : 
' close ^td l._0, indicating tHat Uq^ is a depeft.dable estimate of U5j ' . _\ : • 
Therefore fixing the value, of the 2 facet isn't a serious limitation if the 
'observed scores doh'_t vary much dver_the. 2 facet. .However^ this invariance of 
i tiQi with r|spect to^i would alsb_imply that the validity coefficient in 
Eg(6..1) would not be siftstantially larger^ than theJvalidity coefficient for _ 
the orog'inal universe of generalizattori,. given by Eq(.4.12), arid there would be 
little advantage in restricting the universe of gerieralizatibri; 



ERIC 



Eg{6,2) can^al so be derived by to one in Eq(4.12)i and 

taking the limit as n^ approaches^inf ini'ty,' * ; 



That> is; Eq(6. 2) provides ah.ihdex of dependability. of ihferenees td.UQ.for 
observed scores based on a , single eonditiori of the 2'faeet and ah, infinite 
number of replications; Furthermore, by comparing Eqs(4;iE), (5;1), ahd{6;2), 
it is clear that: 



Eq(6,2a) partitions the general izabiti'ty of Inferences- from Xq-ir to Uq (or / 
to Uoi •) into two parts. The.first part^ Er2(XQif^^Uoi ) > represents 
the dependabi.lity/of inferenc^es from Xq^r to tioi^ the mgan"over 
replicatidhs for. a fixed value of 2. The second part, Er2(uQ-| ,Uq) , is 
the dependability of inferences ^:frd ug^. For the investigator who 

intends: td'' generalize to the universe score, -Uq, ther-efore^ there is no 
benefit in fixirig the^condition: of the_[ facet. As a matter of fact, the 
dependability of inferences to Uq would be improved ^ explicitly . 
'recognizing the 1 effect as a facet, and increasing n-j. - 

.The main benefit derived from restricting the universe of generalization is 
the increase in, the validity of measurement^ the dependability of in-ferences 
from observed scores to universe scores*. The main disadvantage associated 
with: restricting the universe of generalization, is that ft can lead to a. very 
large increase in the number of objects of measurements in the population. .If. 
ther^e were Nq objects in the original population, which can ea^ be paired 
with conditions of . the 1 facets" there are NoNj_pbjejcts in the new 
population. Unless N^- is very small therefore, the number of universe 
scores. that need to be estimated for a complete description of the population 
may be greatly increased by restricting the universe of generalization.- 

■ INFERENCES THAT-GD-BEYDND THE SAMPLING MODEL . ' ^'r:'"-'- - 

~ If the dependence of observations oh the cohditioas of .'the:' :i.- facet involved 
in the observatidhs is investigated, it may be possible: td characterize the 
cdnditiohs of the i facet by ah attribute, Wj, such 'tPiat: v - . 

where f represerits some function. Eq{6-3) Expresses Uq-j as a function of 
two variables; The co[^ROneht., .VQ^ depends an the value of jo but does not . 
depend on the value of' i The eomp^nent^ w-f*, depends on the value of J but 
does not depend on 'the value of .^Jhe be previously 

defined attributes or they may';b€;..4sfined as attributes by the;1aw in 

■Eq(6. 3). Since the basic reason for considering the attribute U5/j qn place 
of the attribute^ Uq, is the lack of dependability in estimates of^'dg^^ • ' / 
universe score^ Uq, is not a very promising candidate for the_,variabie, Vq; 
if the estiniates of Vq cdritaih large errdrs of measured 
impossible to identify an appropriate functional, form,' f , 'fo 



' It is often the case therefore, that Uq^ is- ^^xpressed in a law. with the 
following form: / ' 



(6.4) 



ERIC 



Where g represents somS ftirietibh, .and 1*. is a parjieular . cdriditidh of the ^ 
facet i A hew variable; ugj*, ;. is -defined by restricting the universe -of 
general izatioh for a1] objects of meastiretriehts tq_the fixed reference. _ 
condition, for " the 4- effect. This_new variable can be substituted for 
Vq because\1t is a function of £,but.- not^ Measurements of Uqj and of 

UQ]ic are more, dependable than measuVements oT Ug because j 

a.ssurile iiivariance^ijver ttie j[ effect; this facilitates the development laws of 
the fdrrti given by^q{6,4). : 

if a law like that in Eq(5.4) can be developed, the_l imitation inherent in 
measurements of Uqt can be overcome.. With the help of Eq{6.4)i: measurements . 
of UQi*^pr'ovide informati^pn about all conditions of the 2 effect for which 
Wj is known^ This information is provided by the empirical law in.Eq(5.4). 
Since the use of Eq(6^4] requires measurement of the attribute w-j for all; . : 
values "of t, . this- kind of inference from observatidns involving one conditidn 
of the 2 effect to what would be expected for observations for another ^ ' 
condition of the r effect is more difficult, to develop than inferences* that/use 
dhly ihvariance properties; However^ t complicated approach provides 

a detailed ahalysis'.of- the: relationship Useful as they are, 

invariance properties do not proytefe such an analysis. 

t The empirical law, given By Eq(2i2], relating length to temperature is an 
instance of the kind oflaw indicated by Eg(6.4). This law can be used to 
improve, the^ dependability of the inferences involved in :measuref^ents of length. 
A hew quantity, l^t, imay' be defined as an attribute of a rod at the J 
temperature, t: K' ' \ a 

1-^'= L(rt) - - : . (5.5) 

the object" of fneasurement for 1^^ is a rod-temperature combination^ ri, " , 
instead of a rod. All observations that^are used to estimate 1^'^ must 
involve a specific rod and" a specific temperal^^^ observations used : 

to estimate 1^ must involve the rdd defining /tfieN object d measurertient, but 
may involve a variety of temperatures. The attribute^ l^t has a smaller 
universe of generalization than the attribute Ij^, and the direct . 
interpretation of measurements of l^t restricted to the temperature 
specified in rt_. . ' ■ - ^ ■ 

This restriction is effectively_el iminated by the law of t^ 
Fdr a set of rods that are made of the same material and fo?^/a fairly wide ' 
range of temperature, the coefficient of thermal expansion is a constant, k, 
.and Eq(2^2) can be written as . v / 

where t* is some ftxed reference temperature {For convenience, t* is often 
taken to be 20^0^ a comfortable value for room temperatjjre] . Because 
temperature variations, introduce error into measurements- ^6f;^ (i.e. 
o'2(rt) is not zero)^ l^t* can be measured more dependabljr than 1^. Also, 
the temperature differences,^ (t-t*), can be -rrieasured very accurately, *and^ 
Eq(5.5) prdvides a very gddd fit to data over a wide ^nge of temperature's.. 
Therefore, the'dependability of estimates of _Vt b^#^^^ limited 
mainly by the dependability in estimates of Vt** /Therefore, fixing the 
temperature for measurements of length does not seridusly.limit :the ' 
interpretati'oh of the^se measurements. - - ^ ; . : 

The observations involved^ measurement are .of interest mainly because 
they supp;ort inferences to other observations^^TJiese inferences' are of ^ twd 
kinds. .. First, there is an inference from the 'observation; to ^the uhi^^^^ 
score, the mean dver the universe of generaltzation. ^Seconij^, there are 
inferences frdrri dne universe score to other -imiverse scores.' < . 



Using the bridge analogy of Cornfield ^hd Tukey{1955, p.912U these two 
iriferehces can be represented by the -two spans of a brodge that Crosses a 
river. The first span represents inferences from the observed score to a 
universe score, and the second span represents_ Inferences from the universe 
score to the universe scores for other attr*ibutes. . 

If a well; articulated theory is available connecting an.attribute to other; 
attributes V the ser^ond span, which is supported by empirical laws, itlay be made 
quite long withbu: /eakening the total inference. Laws 'Of the kind indicated- 
by Eci(6.4) make it prof Stable ..tb_ shorten the length bf.the first span by 
harrowing the universe of generalization. Inferences from observed scores to 
the universe score, tiQ-j*, for' the^ restricted universe of generalization have 
a higher validity than inferences to u^. Therefore^ restricting the " ^ 

universe strengthens the f irst^span. A wel 1 confirmed law of the type given 
-in Eq(6. 4) provides a strong seconjd span by justifying inferences from Uqi* 
to the other universe scores Uq^-. Therefore restricting the universe of ' 
generalization does not result .'ir> .any loss in generality; the second span is 
simply bearing a larger share of responsibility for the inferences based on 
observed scores. "A proposal to sample . items from a broad domain at random is 
general l-y_but not always a sign that one's understanding is crude" (eronbach 
et.aU 1972) ' . ^ ^ : 

Note that an invariance property is a special case of the class of laws * 
indicated by Eq(6;4) ; . In particular, if the function, g, is-such that Uq-j 
is a constant for all conditions of the J_ facet ( 1 .'e. ^ vi\ is a constant), 
then Uq^ is;4nvariant with respect to, the 1 facet. In such cases, there -is' 
no loss involved in taking 2^ instead of oT, as the object of measurement arid 
there is_ some gain in simplicity. (In practice, itis often cbriveriierit to 
assume that Uq-} is invariant with respect to the J_ facet,, even where this 
assumptibri is knbwri nbt tb hold exactly). 

A NBTE-DN LATENT TRAIT ^DELS .__ _____ 

■ _ In the-behavioral sciences, when a test consisting of some set of items is 
administered to a person, the observed score is usually interpreted as an 
estimate of the universe score for a disposion^ with generalization over the 
item facet. Latent trait theory can be considered a special case of the kind 
of model being'discussed here. For example,^ the Rasch model (Wright and 
Dougl-as, 1977) represents the.prabability, Xp-j, that person, p, answers 
item, i, correctly in terms of an attribute, Vp, of the person and an 
attribute, w^-, of the item; * 



The'ability parameter is assumed to be an .attribute of the person^ and may; 
vary irom one person to another. However it S"s' expliC:itly assumed that Vp. : 
doesnot vary with the sample of itims used to estimate Vp. Furthermore it 
is at least implicitly assumed that Vp dbes nbt change as other cbriditioris 
of observation, vary. S-lmilarly,, items are the bbjects of measurements "tor the 
difficulty attribute, w-j-i and the value bf w-j is assumed to be independent . 
of the sample, of pensohs used to estimate w-;, and- of the conditions of 
bbservatibn: that may hold when w-j is estimated. , " ^ 

Latent trait theories do not generally incorpoi^ate an explicit theory of 
errors. The kind of inconsistencies that can be attributed to errors of 
measurement, are interprefed in terms of a lack-of-fit for latent trait models.' 



(v - w.) 
^ p 1 ^ 



+ 1 



NdMdLQGICAL NETWORKS AND THE IMPORT- 6F MEASHREWENTS _ .. ' _ 

The development. of inferences beyond the universe of general izat ids for ah 
attribute^ as exemplified .by the use of Eq(6^4), raises the que of tlie 
role of empirical Taws in evaluating measurement procedures; 

'In discussing this issue, -it is. useful to define a third property of 
measurement, in addition to reliability ard validity. This, third property^ 
the theoretical import, or the import , of measurement, can be defined as the. '.' 
tot§il sianificance of what can be inferred _fr dm the measurement (Hempel^ 1952) 
Import emphasizes the scope or range of inferences that can Be drawn_from a 
: measurement as well as the accuracy of individual infef^ences. As defined 
here^ import does not involve a numerical' inde;c, and^itjs not assumed /that 
tlie^-inipdrt pf irtdiv.idual inferences can be measured oh any scale of utilities. 

Hemper(1952,p.46) provides a good example of the distinction between 
'"theoretical import" and and "empirical import"(or validity): 

Concepts with _empirieal import can be readily defined in any 
number, but most of them will be/of no use, for systematic 
purposes. Thus^ we might define the -hage of a person as the — 
• :product^ of his height, in. millimeters and his age in years. This. g 
■::c(ef iriitidh is dperationally adequate/and the term "hage" thas.- 
■ introduced would have relatively high precision and linifoVmity . 

of usage; but it flacks thedrettc.al .import, for we have no 
: generaT laws connecting the hage df a person with other 
characteristics 

Although, "hage" lacks -import, it is possible to measure "hage" with a. high 
degree of reliability and validity: ^ : 

The. invariance properties,, which provide ttie basic justifieatio : 
interpreting dbservations as measurements,, provide a core of import. to all 
measurements.- These invariance properties justify inferences from the 
observed seofes'to the 'universe score, and also, to some ektent^ to all other 
observed scores in the tiriiyerse of generalization.' If thi^ attribute isn^t 
'involved in any other empirical laws, these inferences t& the universe of ^ 
generalization define the total import of the measurement. ' 

^he contribution of the invariance laws td impdrt depends on the 
generality of these laws. For example^ .the import provided by invariance 
prdperties for the measuremetit of Uoi may be relatively niinor, since 
inferences from XqIr to stores for other values of 1 is riot justified by 
these '<tnvariance properties. If attention is restricted to the invariance 
properties implied by the -Iff initidns of their respective ^universe of ; ' . 
generalizations, , the attribute, .Uq has greater import tha^ the attribute, - 
Uoi^ As such,' measurements of Uqi dd hdt justify inferences to other . : 
conditions of the i facet. _ Measurement of Uq, dh the other hand, Involves 
generalizatiOD over the 2 "f^a^et. ' _ ^ ^ ^ 

Measurable attributes that play a central role in .the fundamental theories 
of a science are seen as having greater significance, or import, than 
.attributes which are ihvolve<i in one or two isolated empirical laws, or in no 
Taws at all. The extended network of Taws, which specifies the;empirical • 
content of the thedry and also provides confirmation for the theory, can 
greatjy extend the implieations df measurements. 

In practice, the development of empirical laws can lead to simultaneous 
increases in both the validity and the import of measurements. _ TJiis is done 
by partitidning the universe of generalization into a number of more narrow.ly 
defined subuni verses, while connecting the universe scores for thesie ^ 



StibUhiverses thrdugh empirical, laws. Physics Has used this strategy very 
effectively. The history of the measuremeritvof such has.ife .measurable 
^ attributes as lerigtfr reveals the gradti^^ refiherrierit of their universes of 
generalization. In the course of this development many-facets have been 
standardized^ jnclud]\ng the ph^^ lengthi the 

temperature, \nd numerous aspects of procedure; At the same time, the import 
of length measurements has been. increased by the use of nomdldgical networks, 
including Euclidian and non-Euclidian geometries^ classical mechanics, . and *the 
theory of relativity. Using these theories, inferences can be drawn at one 
extreme^ about the distances between galaxies, and, . at the other extreme,^ 
about the sizes.df sub^tbnic particles, while defining length in terms of 
operations involving a particular platirium-iridium bar in Paris, 

NgwgbSSieAb^NETWGRKS ANB. VAblBITY - 

It is clear therefore that the existence of empirical laws, and more 

importantly, nomological networks, may greatly extend the range of inferences 
that can be drawn from measurements. Therefore such networks can greatly ; 
increase the' import of measurements. _ But returning to the question posed 
earlier, what are^the implications of such networks of laws for the validity 
of measureraeritl In particular, what are the implications of ah empirical law, 
like Eq(6.5), for the validity of measurement? ^ 

> • 

The definition of dispo'sitiona] validity states t^ measurement 
procedure is valid to the extent that its observed scores provide dependable 
estimates of universe scores. If this definition of validity is accepted^ the 
existence of empirical laws relalWng measurements of an attribute to 
measurement of other attributes has no dij;>ec_t bearing qn_ the validity of 
measurements of the attribute. Campbell 'T1921, pp. 109-134) distinguishes 
clearly between the development of ; an acceptable measurement _procedure_ and tjie 
application of this MP in the discovery of empirical laws. Eampbell (1921^ 
p. 134) recognizes that it is,^ "because true measurement is essential to the^ 
discovery of laws that it is of such vital importance to science", but he does 
hot use these laws to justify particular M Measurement procedures are * 
justified by a careful examination of. the operations involved in'the MP and by 
experimental verification of invariance properities. 

- The dependability of estimates of universe' scores depends.d.ri the 
definitidh df the universe cf general izatidn, the magnitude df variance 
components^ and the extent of sampling df'varidus facets fdr each dbserved 
score. _ Fdr example, assuming that_o2(Q-jj -js greater~*than zero, 
-generalization from ^Xp^-R to Uq^ will be more dependable than _ ■ I 
generalization f rom *the same observed score, XQ;jg, to *the more broadlyV 
defined universe score, Uq. However .he dependability of estimates of Uq 
can_be_ improved simply by sampling tre 2 facet more thoroughly. In general^ 
measurements of Uq can always be made as valid as measurements of Uq^^ by 
increasing sample sizes. ^ - 

All derived attributes and most basic attributes are ihvdlved i'ri empirical 
laws, and these laws may be tied to the body of laws and theoretical 
constructs associated with a theory. For example, the law of thermal 
expansion, Eq{5.5) , describes phenomenajthal^^^ 

to explain, and at least a partial explana^lbn df these phenomena^^^an be 
provided in terms ofthe motion of molecules. However, the existence of, such ^ 
an explanation is not necessary in order to interpret a coefficient of thermal 
expansion.- With or without a theory, a coefficient of thermal expansion is 
interpreted as a meas^ure of the degree to which a rod will expand" when heated 
and contract when cooled. A model for the molecular structure of sol ids may 
provide, insight intd why this phenomehdn occurs j but such models are not 
necessary for an understanding df what the phenomendn. is. . 



44 



However, networks of iitipirical laws do have a greats but indirectj . 
irif itieoee :dh the validity of rtieaslirernents. The. laws in the. fretwork it 
feasible to restrict the universe of .general i2at ion for attributes and 
therefore to increase the validity of rrieasurernertt's without decreasing their 
jniport;_ These mor^ defined, attributes depend dri thg' network, fqr •. 

their import, rather than on the^ihvariarice properties, and the .magaitu^de of 
the errors of measurement are reduced because- assumptions are made for fewer^ 
invar iance properties, ^ ^ • 

THE TRADEOFF BETWEEN VALIDITY AND IMPORT ■ . - • ■ . ^ 

If the issue. of import .is ignored, it is easy, to develop attributes that_ 
can be measured with a.high degree of validly, _It^ is only necessary to define 
the universeVof generalization- very narrowly. For harrbwly def*Sied universes^ 
the inferences from observations to the universe involves generalization over 
relatively few facets, and in such cases estimates of .the universe -^scores are 
likely to be very dependable. In the extreme cas.e, where dbseryatioris are: 
Interpreted simply as .gbservatjonSj th no inference, and estimates of 

the universe scores are perfectly accurate:. : ». 

However, researchers cannot ignore the issue of import^ 
which involve tradeoff^ between the validity of measurements and the import of 
measurements must be made. The researcher who interprpts observatidns:* 
narrowly will draw more accurate inferences than _the -esearcher who interprets 
observations broadly, but the inferences of the first -esearcher say- less 
about the world than the inferences of the .second resea ^c^;;^r. The chdice 
between narrow but dependable interpretations and broader but less. dependable 
interpretatldns is a choice of strategy. The continuuijt of "available ' 
strategies is mdre-dr-less^ anchored by strict operationism(Bechtolt, 1959) at 
on end and by construct validity (Cronbach and Meehl^ 1955) at the other end. 

Strict operationism demands that attributes be. defined narrowly enough to 
insure tyt the validity ofthe interpretatidhs is essentially perfect. The 
/Strict operationist is unwilling. to give any hdstages td the future in the - . 
form of invariance properties that might tgrn '^t^t to be only apprdximatidns. 
If taken seriously, this position implies- that observations-Should not be 
assumed to be general izable to any wider universe- of general izatidn, but ^ ^ 
shduld instead be interpreted simply as observations^ All of the implications 
of _the observation are td be derived by developing empirical laws that stat^ 
reiatiohships between qbservations. Stsict operational ism is; the strategy of 
pure empiiricism^ arid thedry plays essentially no role. (Bechtoldt, 1959) 

Construct validity, iri its most general form^ would define' an attribute in 
terms of all of the relationships, in which the attribute appears.:_ From the 
standpoint of construct validity, the definition of an attribute entails 
certain laws, and, .in order for a MP to be valid its observed scdres must 
satisfy these laws. These assumptions are the postulates of S theory, arid may 
state relationships betweenthe construct and other constructs, in addition td 
some set of invariance properties. 

CGNSTRUCf: VALIDITY _ _ • . ^ . 

.. It was stated earlier that reliability is a property of a measurement 
procedure, while validity is a propertyiof a measuremerit procedure and the . 
attribute being measured. Construct validity is def ined' as a prdperty df a 
measurement procedure, a construct, and the network defijiin^ the construct. ' 



"Acceptance," whic^ was. pn'tical ^'in criteriqnr-orj - . . 

content validities/ ^as now appeared in construct .yal idity. , , • 
■ Unless .substsntial ly-the $arne nditidjqgical net 

several users ot. the bqnstruct^^ tiu^^^ 
•.^ /'A\ consumer b^f th ' • .* 

cannot: accept the author's validatiSh. -/He. fmtst' validate the.« - : 

test- for himself ^ if he Wishes to siidw that/it represents the 
construct as :he defines it. :• 

1 ; ■-• x^.. ' • ,_; 

Therefore, tn construct validity accurate to say. 

that .:§ jne§sur?ment procedure;is valid for cr construct.^ Instead^- a claim for' 
construQt^va1>dity should specify a measurement procedure^ a coristruct^ ani^ 
the theory defining the construct. : ' \ : • • 

If. constructs are defined. in term^ of a network, changing the network of ■ 
laws in which the eon struct is embedded ^imp lies a ch'^nge in the fefiri^tiori of 
'the construct^ Therefore ; ^evidence for ^thecon^ o? a " "j 

measurement procedure may not -apply in different networks. /The invariande - 
properties that 5re tested in validating a dispositional 'attribute form -a \ 
: subset_of rlaws that may' be part of many' tlieories. Evidence for dispdsitiipnal. 
va.lidi'tv of a'measuremen.tj^^ iri all networks that invol-ve the 

same di^nitidh of thg attribut-e and therefore the saSe set of irivariarice. 
prpperti^s. / " ^ ^ '* ' . _\ , 

1 _As ;st|ted':'earlier, all of the procedures included in dispositional ' ^ 
valldTty/arevcb with construct validity. Indeed, the procedures 

■suggested Here form a subset of those* proposed by Cronbach and Meehl (1955). 
The dffferen'ce between the t^^^ approaches is in the fact that^ -within / 
constrcict validity^ the_testing' of any law involving a construct cduld be 
construed as a test of the vaTidity of a measurement 'procedure .for the 
construct. Dispositional validity r^ricts. its atteiitidn td*: irivariance 
prdperties; the testing* df laws dthe^^haii- the invariance properties is seen 
as being directly relevant td an attribute's import but not to its validity. 

_Norie of .the many types of researcR;_included within construct validity is. 
excluded by the definitions in this paper. The majTor change, being proposed is 
in how^some studies will be interpreted, rather than in the kinds of studies 
to be done. In particular, the_ whole spectrum of research that could be 
interpreted in terms of the import of an attriBute^ would^ within construct ' 
validity, be interpreted in terms of the validity of a measurement prdcedure. . 

Therefore^, the apparent loss involved in giving up evidence from some, 
parts of a nomdJogicaT network is riot reallv a Idss.at alT. For dispositions, 
the inferences /theft are subsumed under cdhsti;uct ,val idity are divided into two 
-categories, validity arid import* Norie; of the'' studies encouraged by -construct 
validity is thrown away in considering dispositions^ results*^ of some 

oflth^se studies :are interpreted as evidencS^for the jmgOT%;;rather;^tha^^ 
validity^ of the disposition^ Although thp/ evidence that can be tfsed'^^to * . 
support dispositional validity is more restricted than ttie evidence permitted, 
by construct validltjv^theinferenc are drawn by dispdsitidrial' val idity 

are more restricted than those drawn by con|truct validity^ 

Construct Validity, is a prOpeHy Of .measuremerit procedure iri relation to a. 
network Because a.measuremeriV procedure; is s to have coristruct.validity by 
v.irtue'Of its iriclusiori iri a validated .rietwork.^ For dispositiorial validity,: 
import is a property of the network as a. whole, and validity is'a property of' 
a measuremerit procedure iri relatiori to the urii verse of general izatiori definirig 
a dispdsitidri; . 



DISCRIMINANT VAbiDITY _ . ; V 

_ _ Diserinjiharit val idity^^eari be examined by: estimating variariee eiDmpdherits. 
A. large variariee eemperient'fer the inleraetibri between attributes and bbjects.: 
of measurenjerit,;^.w^ indicate that the ,iriter(^orre]atieris anierig^-^ ' 
attributes are small. This implies the universe: score for one attr^ibtite 
doesn't prov-ide^ a dependable- estimate of the universe scpres for other 
attributes^ and^therefore that each of the^attriButes being measured provides 
Inforfna'tion which .is; irrdependent of thi. information tn thg other, attributes. 
However, as applied to dispositional attributes^ .tf^^ validity is more 

closely associated with the import than with validity. 

^ The logic of ''discrimiriarit validity depend^ ori the existence of, at least, 
.a rudimentary theory arid A2, repre'serit - 

hypoyiesiz^d constructs tha are "'.assumed by some theory- to be unrelated, i' 

woulfjd be expected that measurements of th'ese^attributes should also be 

unrelated. By the logic of construct val idity, .a* strong relation between 
measurements of , A]^ arid A^ could ±ie iriterpreted as evidence that at least 
one of these two sets of measurements is invalid. 

This kind of' logic is riot gerierally.apprdpriate for dispos 
attributes, because dispdsitidris are.defiried in terras df urii.verses of : 
observaSoris, arid do not depehd ori theoretical .rietwdrks for their- me ariirig. 
Assuming that^Aj and Ag. ar cleanly defined dispositions, a strong 
relationship between measu.remeri'ts of these two attributes would 'be .interpreted 
as- an empirical law ttiat theory would be expected to explain. , In particular, 

a high correlatiori 'between measurements of A| and A^ wpuTd be evidence 

against any ^heor| which treated these. two dispositions as being unrelated. 
:It would not be iriterpreted' a,s evidence for a lack of validity in the 
measurement- prdcedures. . ' ■ 

The exact interpretatid^i .oT-the relatidriship ,betwel^^ arid A2 would, 
of course, depend on how the two attributes are defined, if the two universes 
had a high degree *df overlap in their^observat^ it would riot be surprisiri'g 
that -the means over these two universe^ .are^related. Ho^ if there 

w'ere no "overlap in their universes ahd-np other reason to expect a 
relationship^ the two^ attributes couTd^be strong Vy related without having the 
validity of their measurement procedures denied. ' .'^ ,v ^ • 

Fdr ^example, the empirical result; that there is ^a str^dpg cdrrelatipn 
betweeri the th'ermal coriductivity arid electrical cdriductiyity is ridt takeri as 
eyiderice ^gainst the validity of the procedures used to measure these twd . 
dispositional attributes. 'Instead, the jcelatidriship betweeri these.two : 
apribut.es is a legitimate finding, which- a, theory of .so.lids would - 

be expected to accomodate. 

It IS generally true, however, that' a very high correlation betweeri 
different attributes may limit their usefulness for yar1ous_purposes. If two. 
attributes were perfectly correlated^ the re'asurement of -either attribute • 
cduld.serve al 1 _df the purposes served by measurements of both attributes. ; 
Therefdre, the impdrt df a measureirierit procedure deperids dri its hav44ig - 
discrimiriarit validity.: *^ ' 



:4r 



ERIC 



45 



VII An Overview of Sampling Models for Validity 



The earlier, sect TdhS; :0f this paper /have, discussed a sampling model for ^ 
dispbsitiohai attribute's in some details __Thi.s section provides a gehei^al. . 
^. ■ suiftriary^df tjiis sampling models and briefly. discusses some issues, iricludirig 

• dbjeetidris td sampling models; not xdvered in the^previdus sectidnVi 

' A GENERAL SUMMARY OF- THE SAMPblKS^^Eh^ . ^ . 

The sampling model for the validh'ty of measurements of dispos it jonal 
^ attributes, is based.on generalizability theory. The "true" value of a 

' dispositional attribute is the universe score,, defined as the mean over the 
uhivg^se of generalization. ' A measurement procedure is valid for a 
. dis^sitidnal attribute to the extent that. the observed scorgs generated by 
the procedure are dependable estimates of the universe score. : ■ 

_ The basic premise of sampling models is that measurement involves 
inferences from observed scores, which at:e based on samples of observations, 

to tfte'^an over the un^ from which these saS|1e drawn; for these 

inferences to be justified, a* set of invar iance^cr must hold, at least 

approximately;. The invariance properties must b| verified . empirical ly..-; 

Dispositional validity could be estimated directly by using samples of 
observations, randomly.sampled frgm the universe of generalization,, td.estimate 
a generalizability coefficient, which reflects the dependability of inferences 
from observed scores to universe scores. Althbugh this diraet approach has 
the virtue of simplicity, it is not ^practical ■ in mdst eases. 

The sampling model is ba$e,d on a small number of assumptions. The sampling 

j J model, makes no assumptions about underlying constructs; It neither affirms nor 
' .denies the existence of jsuch theoretical constructs* The model makes no 

asssumptions about the distribution of. observed scores^ the distribution of. 
universe scores^ or the relationships between different kinds of:abser\/ed 
scdres. Nd'restrictions areput on the universe of generalization or the 
universe df allowable dbservatidhs except that the uhiversg dfiaTlbwable 
■ ' observations is, by defihitidh, included within the universe df general izatidn; 
in particular, the model doesh'^t dictate what kinds of conditions can be_ ^ 
defined as facets. The one ass^ is necessary for the sampling model 

is the random sampling assumption. 

: OBJECTIONS TO THE SAMPLING MODEL ,\; ; \ ' 

_.._The simple version of the sampling mddel assumes that an. attribute i^ - 

• defined in terms; df a universe df geheraMzatidh arid that measuremerits are 
\ ■ ' based on random samples from this oriiver-se.:'^--_This simple versiori df.the _ 

sampling model ignores the complex sampling designs that are' actual ly employetl 
:■ by measurement procedures. In- practice, .the assumptipri_that observed scores. " 

• are based on random samples from the universe of general izatidri does, ridt apply 
> whenever any facet is explicitly or implicitly standardized. . / 

• ' ■' . ■ '■ ' • * 

A_number of authors (Lpevihgeri 19^ Roziboom, 19^ 

. objected to the simple version of the sampling model because behavioral 

measurements do not generally consist of random sample's from a clearly defined 
universe of gerieralizatidri. Antoiguity iri the defiriitidn of the universe of ; \ 
generalizatidiT. is ridt. unique to the sdcial sciences. The discussion of 
•attributes iri an earlier section df this, report made a pdirit of emphasizirig 
many of the samplinj^roblems, associated .with measurement iri the .physical ; 
■ sciences. ."For most attributes, the boundaries of the universe of _ 
; generalization, terid to bequitefu^^ sampling of .conditions of 

vari.ou.s.^facets. js far from random. I-tJis generally imposl^^^^ to Select 
random samples from the universe of general izatfpn, in part, because, of • 
vagueness in the def inition*;Qf i.the universe of generalizatidn, and/ in part, 
O because of- practitalv difficutt.ies, „ . - . ' 

ERIC . .. ■ : . \ . 4| ; • ■ -'\-:My 



THE VALIDITY OF MEAS'UREMENT > SAMPLING. FROM THE UNIVERSE OF GENERALIZATION . 

All- Of these, objections emphasize the problerjis inherent in _tryirig. to. take 
randoni samples from the universe of gerteralizatidn. In order for statistical 
inferejlces from observed score s_±o universe scSres ta ;be unbiased, the sample 
of- observations must be seleeteVranddmly from the : universe of general izatibh, 
arid the simple version of the sampling model as'sumes that observations are ^ 
randomly sampled from the universe of Toithe extent that the 

random sampling assumption is untenable, the simple version of thesampling ■ 
model is untenable. Since it is usual^ not possible to sample the universe 
of generalization randomly^ this simple model is seldom applicable. 

The sampl irig model developed In this paper recdgriizes that the estimation 
of _auriiverse_may involve a relatively complicated set of inferences. Within 
this more realistic model, inferences from observed scores to universe scores : 
may be analyzed into several steps and may require the conf irmatidri of an ; 
extensive network of laws for their justification. In particular, the 
reliability and v|lidity of a mea involve a_ large number 

of invariance prope.pties. The investigation of import would add the other 
empirical laws to the network. ■ 

In light of the substantial difficulties in drawing rariddm samples from, 
the uriiver|e of gerieraliz^atidri (ridt td•^meritidri the demands on sample sizes for 
the estimation df variance comporients, when random sampling is possible), how;. 
is this network to be verified? In the physical sciences, the verification of 
the required set of empirical laws is riot generally accomplished jri a single 
study. It is^everi^ess likely that behavior a 

evaluated in a single study, in /which observations are randomly sampled from 
the universe of generalization. - 

. Even a cursory review of^-fte issues'^'-involved in the cdrifirmatidri of laws 
Wduld gd far beydrid the scdpe'of this paper. However a few:.gerieral remarks dri 
this tdpic may put the prdbleml: irivoTyed iri testirig the invariarice prdperties 
. iritd perspective. ■ o.* ■ 

.>J<arl Popper(1955,ig58) has suggested an approach to the verification of 
laws^ which accurately reflects the practice of sciei>ce, arid has beeri widely 
:accepted. Popper views laws as conjectures which can: b^ tested in various 
w-ays^ but which can never be definitely confirnfed. .A general law can be 
applied to a large number, of obseryatians^ arid-^^ny^^ observations can 

be used as tests of thejl^w. ' A det^rministic laW/that fails a single test or 
a statistical law that'fails a la^e prdpdrtidri df its.tests is refuted. A. 
law which is subdected td large riijmber df tests df varidus kinds withdut being 
refuted is cdrisidered td be supported by these tests. 

There is a clear lack of symmetry in the treatment of laws as conjectures 
that are. subject to refutation; a law can be refuted in a single study, but, 
in general^ even a large number of studies cannot definitely confirm the law. 
The more tests of various kinds that a law has teen 'exposed to without being 
refuted^ the more -s^trongly it is considered to be supported, but the law is 
never cdmpletely cdrifirmed. In a sense, therefore, -Pdpper replaces the cdricept 
of the. cdhf irmatidri of a . law by the cdricept of the degree of cdrifidence in the 
]aw. Each successful ^-gmpirical test df the law.iricre.ases confidence iri the - 
law;-,and the failure of one test can cause the law- td be refuted. 

Toulmin (1953) describes physical laws not as "ind^ but 
ds rules. of inference, which can be used to draw conclusions from obseryed_ 
facts, .the question to be asked about sucR rules of inference is not whether 
they are true or hot^ but how widely do they apply. Toulmin analyzes the rdle.^ 
'Of- laws somewhajr. differently, from PoppBr^ but^ for the purpdses of this 'paper, 
.ithese two views are complementary. The presumptions that are made about the 



class of observatidris to which a law applies are tested every time the law is 
applied^^ arid therefore such presdrriptidris are cdrijectdres which are subject td 
: refdtatidrii ; / 

_ Both of these^analyses of the methodology for the Confirmation of 
scientific laws have implications for the validiation of measurement 
procedures. The definition of an attribute involves a universe pf 
generalization. The claim that a measurenient procedure generates. valid 
. measurements of the attribute is equivalent to the conjecture that the 
observed scores are. irivariant_w1th respect to random samp^Jing from the - ! • 
universe of generalization. This conjecture can be decbmpdsed into a riu|jber • 

of more specific irivariance properties, each of which applies to a specific : 

facet. - ' ' 

A successful test of any one of the irivariance properties,- whicR^is based 
; on a random sample of conditions from a ;facet^ provides strong support for the 

specific invariance property, and somewhat weaker support for the cliister of 
invar iarice properties associated with the attribute. As the number of facits 
that is irivestigated without encountering refutations of their invariance 
properties iricreases, the degree of suppdrf for the corijectur.e that the 
measurement procedure adequately represerit's the attribute increases. 

G studies,' which do not sample randomly from a^facet, but iristead, sa^ 

from a restricted subset of the universe for the facets don't provide adequate 
evidence for irivariance oyer the full facet. They do provide evidencS fdr^ 
irivariarice over the suburiiverse sampled, they provide some support for 
irivariarice over the.facet, arid they also provide some support fbr invariance 
' dver the universe of gerieralization as a whole. Such a G study provides, drily 
weak eviderice for the validity of a measurement procedure, and it may require . 
a.-^ large number of such S studies to develop s high degree of co^^ in the; 

interpretation of observed scores as valid measurements of an attribute. 

, < ■ . 

'The evaluation of eviderice for dispositional validity is complicated 
further by the fact that the support for invariance over the universe of 
general ization. that is pro^vided by evidence for invariance.over a particular 
facet will depend on the facet studied. If there is some reason td suspect 
that a particular facet may have a Targe effect' dri observed scdres,. evidence 
for irivariarice over that facet .provides relatively st-rdrig support fdr 
irivariarice dver the urii verse: of gerieralizatidri. Fdr example, iri evaluatirig 
the dependability^ of performance ratings, the variance components for raters 
could be substaritfal, while the yariarice comporierit for equipment would be ^ 
negligible in many cases. Therefore, ari irivestigation of the rater facet 
provides a more -severe- test than an investigation of the equipment facet, and 
passing the more severe .test provides stronger evidence' for the overall 
: dependabii lity of the measurement procedure than passiiig the weaker test. > 

. RANDOM SAMPLING FROM THE URIVERSE OF ALLOWABLE . OBSERVATIONS . 
" -^The randdm sarriplirig of dbseryatidns_i's important for the- meas^r^merit of 
dispdsitidrial attributes iri twoiways. First, a measurement prdcediire is *_ 
: defiried in .terms oF .random s.amples from the universe of allowable observatidris 

{d studied}. Therefore, the application df the measurement procedure td ariy 
object of measuremerit requires the raridom sampling of dbservatr^^ from 'the 
universe of allowable observations for the object of measurement* Secorid,^ 
studies of the properties of me^asurement procedure (G studies) requi-re random . 
sampling from the pdpulatfon of, objects of measurement and random sampling 
from the Universe df gerieralizatidri. ' ..... 

The purpdse df G studies is td prdvide data that can be used iri; the. design 
of effective measuremerit procedures^ Iri particular^ an importarit goal iri 
designing a measurement prdcedure is to reduce the' riumber of facets that must-' 

EjIc ■ , ( Jo ■ :. • . 



7 



be randomly sampled in obtaining an observed score arid. there are, three ways to 
do this. -First, if G studies show that all of the variance cdmpdnerits (for the 
main effect and int'eractioris) for a facet are 'zero, there is no" need to be 
conceded about how this facet is sampled-^ Second; facets that have been 
standardized are not sampled in estimating universe scores, but the variarice of 
the systematic errors fdr_these facets:fii/st be estimated in S studiesi Third; 
if tha universe of general izitidri is restricted in connection with the develop- 
rmerit of theory, thus'def ining a_riew attribute, the universe of allowable ; 

observations for measurements; of this new attribute will also be. restricted; _ 
• that'is, if the new attribute does not irivdlve generalization* over a. facet, 'the 
measurement, procedure will not involve sampling of the facet. Such theoretical 
developments .are also based, in part, on the results df G studies. . 

. All of these modifications of the measurement procedure tend to decrease 
the number df facets which must be randomly sampled in obtainingan observed 
score. The drily facets that rieed tO; be sampled randomly -are those for which . 
interactions with the objects of measuremerit are fairly large and apparently 
random. Efforts to obtain randdm samples cari be cdricentrated on theser facets, 
aYid as the number of ^ such facets decreases, the difficulty in 'taking random 
samples from the universe of allowable observations is reduced. 

One of the results of these changes is,; therefore, to transfer the burden . 
df.raridom sampling from the measureraent procedure to the 6 studies; If these 
efforts _td" reduce dr eliminate the- random errors for various facets are 
successful, the variarice. compdrientsH'-df alJ the facets in the_universe of . 
allowable observations will be. small, aMr.ther^:Eor £^ the mea surement 
procedure will have v a high reliability. 

• To the exterit that the uriiverse of aVldwable observations is homogeneous in 
the sense that all observations iri this universe yi'eld approximately the same 
valueMsf the observed score foV_ each object of measurement, it does not matter 
which observation is selected from the universe of allowable observations, or 
how. this dbservation is chosen.' This is especiany_t^tie wjien th^ random error 
variarice is small compared to the variance of specif ic systematic errors. To 
ttie extent that this kirid of hdmdgeneity is attalned in the universe of 
allowable observations, the measurement procedure is robust. against violations 
of the random' sampling assumptibri fdr the facets in the universe of allowable 
observations. 

This assertion may"^s'eem extradrdiriary ^ince statistical cdriclusidris are 
never robust against violations of their sampling assumptions, but the - 
situation being described is not one that is^typically encountered in _ 
statistics. In most statistical analyses,' the popuTation is fixed arid the aim 
df the study is to estimate some .p-arameter: for the pppulation; Iri order to _ 
obtairi unbiased estimates of the parameter^ it is necessary to sample randomly. 

In developirig a measuremerit procedure t is quite different. 

Here, the*universe of allowable observatidns from which observations are to^ be 
drawn is not fixed; The goal is to make the variarice '_1ri observed scores for^ 
each object of measurement as small as possible by refiriing the definition df^ 
the. universe of allowable observations^ To the exterit that this goal is_, 
achieved, all observed scores in the universe of allowable observatidns fdr 
each object/of measurement _wi 11 ibe approximately the same, arid it will riot . 
matter which dbservation is .chosen. " ' ' 

Tdr example-i iri.mea^uririg leng^^^ it isn't necessary to randomly sample . 
from the universe of meter sticks, because it is. known that all meter sticks • 
give essentially the same result. Therefdre, the most cdnvenient meter stick 
is used._ The justification for stieh practices is found iri the empirical 
generalization that the variance iritrdduced iritd observations by the choice df 
. rneter stick* is. ^gry smal 1 compared to the variarice fritrodueed by some other 
ERXC "factors (e.g.* temperature j. 



GENERAblZABIblTY COEFFICIENTS AS UPPER BOUNDS ON VALIDITY _ 

The main souree of difficulty with the simple version of the sampling model 
is that it ignores the eomplex/;Samp1ihg pro employed 
by measurement procedures,- In- order to obtain a sihgle poiht estimate of the 
yalidityj^_this_s1mp1e_mgde1 makes the unreal ist^ observed ' 

scores ar*e-.based on random samples from the universe, of 

practice^ this assumption does not apply whenever any facet is standardized.- 
The explicit recdgn.ltion of standardizatien leads to a/more complex^mqde 
does. hot claim tt) provide point./gstimates of val idity^ but aims instead to 
provide a series of upper bounds on. the. validity. 

Standardization changes random errors ihtd systematic errors. The 
evaluation of systematic errors due to staridardizatiori requires that ..the._ 
sampling variabilitty for the standardized facet be estimated ihdepehdentjy of 
the other facets, ;'and there may be many facets that are standardized' in a given 
measurement procedure.- In general, if is^ot practical to" draw random samples 
from the universe of generalization and therefore all the inyariance properties 
for'an attribute cannot be evaluated in a single G study. In particular, 'it is 
usually not possible to obtaia independent- estimates of variance components^ for 
more than a f^w facets/without.-hayirig very large sampl^ sizes. 

A serdes of upper bounds is perhaps a less. satisfactory result than a point 

epjmate^pf the validity, but is generally more realistic to consider the • 
coeffx^Cient Pesulting 'from 6 -study as- an upper bound oh va-lidity than 

as an' unbiased point estimat^i, " 

THE PROCRUSTES EFFECT IN DEFINING UGs ._ _ . _ /_ __ , i 

ThroughcKJt most <5f this\paper, it has been tacitly assumed that the / 
universe Df generalization defining an attribute is fixed; and that, the task is 
to ihvesti*gate the irivariahce properties implied by the attribute/ s definition. 
For purposes of '^exposition, these assumption? have, been cdhvehient, but, in 
practice, the situation is never quite this'^simple (see Cronbach, 1971, p.4,82|/ 

The definition "of the universe depends, at least in part, on the invariance 
properties that can be established.^ Initially^the;def init ion of the universe 
is likely to be very logse^ with many facets, defined Vaguely by standard . 
expressions _such as, "within normal 1 imits", 'Over time^ the uriiverses of 
conditions fdr varidus facets may be clarified, as -mdre is learned-.about -how, 
various conditions of the facet, influence .^bservatidhs. _ For ^e if it is 

.found tfiat observations depend strdngly-qh the chdice df _cdnditidn_fdr a 
particular fScet, it may be necessary to restrict the definition df the 
..universe for that facet to a particular condition, or to a small, set of- 
conditions. • 

■ On the other hand, if an^^ttribute which was:>e^xpe^ - 
different conditl-ons of a particular kind, is fotind to be invariant over these 
conditidhs, the universe of generalization for the attribute may be extended 
•td. include this kind df cdnditidn as a new facet. 

In most c,ases, -decisions about whether dr ndt td generalize oyer a _^ 
particular class of cdriditidris will depend^ in part, on whether observations: 
are invariant over the conditions. If the observations are ndt invariant over 
the class of^ohditiohs, including thes.e conditions as a facet in the universe 
of generalifpion would decrease the vaHdity of measurements df the' attribute; 
therefor|/"*t^^^c^ndit ions are not likely to be defined as a. facet of the : 
universf^of gine>^liza However, if the observations are at least 

apprpxiHi^tely invfeiriant^ver the class of conditions, broadening the definition 
of_the Stjtribute td include these Cdnditions as a facet would not decrease 
validity; and weuld. fricrease ,the usefulness df the mea-surements. 



THE "steady state REQUIREMENT 

Cronbach et a] (1972) raijse an issue which they treat as a Tirriitatiori of- 

general izability theory: ^ / 

: "Because our tnqdel treats_edriditibris within a. facet as . . * 

unordered it will hot deal adequately with the stability of . . • 

scores that are subject. to trends;.; . . - . . 

arising from the measurement process, • .a large contribution ' 
will be made by the development of a model for treating 
ordered facets, ••"(p. 364) 

' It is clearly_inappropriate to .consider consecutive observafcibris as- 
• rartdom I;/: sampled from a universe of generalization if these observations are 
known to dgperid systematically on time or on any other facet ih theuniyerse. 
However this should not be viewed as a 1 imitation in a theory of measurement. 
As^ting that the theory of measurement is intended to analyze the methods 
r»er than the substantive content 0^ science^ there is no need for it ,to . • 
> c^er functional relationships among different variables- 
According to 'the doma in samp 1 i rig- model proposed here;* to generalize over a 
facgt is to -treat the variability of observed scores due to the samp,] ing of the^ 
facet, as ertor... Where the leonditiohs of some ki a l^cet, thi"^ 
observed score is interpreted as an estimate of the mean over al 1 conHitions. of 
the facet, and the observed sc'Sre is not associated with a particular cohdTtion 
of the facet. On the other hand, in order .^to. recognize a relationshiip)- between 
observed scores and the conditions of a facets each observed score must. be 
associated with a particular condition of the facets and this implies the 
absence of generalization over the facet. Therefore, within'lJlis mpdeU "ft is, 
inconsistent to say that a particular kind bf:;Cohditioh should be included' as 
a facet 'in the universe o-f /generalization, and the same time^ to say that 
observed , scores are a functidh of the facet. - • ^ . \' . 
_. . A ■■■ 

Tfieproblems introduced into' sampling models by the existence of trends can 
be eliminated -as soon^as the trend is detected; ^h^^ is accompl ished by 
restr^'cting the universe of generalij^ation for -each ob^ a, f ixed^ 

tohdition of the facet involved, ahd-by treating the tr^nd. as an empiricar law 
(see section VI). Undetected trends;^W:|l 1. tend* to cau the variance components 
for the facet to be large, and therefore .the examihatioh' of vJt'iahce components 
can facilitate the detection of trehdsi^ ■ : .;■ 

■ COMUDING' CQW^^^ ' . , - . ■ 

The sampling model provide.s a framework 'for cohsidering the issues that-^ 
arise naturally ih the interpretationof measurements in terms of dispositions. 
The ihree types of issues that have been identified ^re^.those associated , with 
rel jabi lity, val i Rel labi 1 ity indicate^ how we 1 1 observed 

.scores represent the" universe of allowable observations^ Validity ^indicates; how 
well observed scores represent the universe of general izatiori defining an : 
attribute, and Import indicates how well the observed score predicts other 
observed scores that are of interest. 

/ Since measurable attributes in both the physical and behavioral sciences^ 
are. "interpreted as dispositiohsi;, these issues arise for eVery measuremeTit 
proc,ie^dure.. However, the way in :v5h'icfi these issues should Be analyzed^is not " 
fixed; For convenience, most of th? discussion in t^is paper has been /in 
terms ;of variance components and general izability^ c same 
points could" Rave been made in terms of Vcorrelatign coefficients.^^ In fact^ 
where one is interested in the relationship' between the observed scores for 
particular conditions of a facet and universe scores^ it would be;natural to 
use correlation coefficients., In dealthg witlT/categoricaf:dat;a^^^^^ rather 
different .set of indices would need to-'^be estimatjed. ■ ■ . 



Altliqugh variance^c^^ seem to. match the as sumpt^^^^ of the sampling ; 

model^far dispo^ especiaUy weH^ the fdrnial statistical models defiriirig 

variance components should not be. allowed to dl>scure the. fundamental concerns 
embodied, in the iinvar iarice propertiesi As Erohbach(1976|_has Observed, the 
technical apparatus of' geheralizabflity theory is less important than thee 
questions suggested by the^thebry; 

The sampling model; has a numter of ^ It is formulated in terms 

bi" tbe\fundamentaT^^ sampling^ and the model is 

basically guite simplej. j^^^^^^ attribute, is defined- in terms, d'f a universe of 
generalization, and the universe score for the^a-ttribut.e Is simply the. • 
expected value over the uniyerse of geheralizatior^iV^: i;'^ .-^ 

; If one w-ishes to assume that the invarlance propert Tes_assqciated_With a. 
.universefQf general izatidn reflect some underlying structure or process ' 
:assdci'ated with the attribute, on^ is free to do so. However the definition 
of; the. universe of gener^lizatioy can alsp .be treated as_a matter of -.i 
converiijence or convention*; The 'sampling model is consistent with either- of 
these t^o points of view. , : - . 

Although the sampltngmodeT makes few assumptigns, it provides an analysis 
of many issues, associated with the dependabi lity of measurement. _ It makes it 
possible td give^.validity a. straightforward interpretation, and to draw- a clear ' 
distihctidn between reliability and val.iaity^ The. conclusions that rel.tabilty 
is an upper bound Gri .y^ and that some mean^'?;'j&f improving reliability may 

^cause validity to decrease can be easily derived frm the model. Furthennore^ 
the model provides a basis for a detailed analysis of standardtzatidn: and of 
the resulting systematic errors. Convergent validity can be anaTyzed'in terms •> ^ 
of the standardization of a method facet. . " ., ' 

^ As shown in section VI, the model suggests an explicit mechanism for 
relating the refinementj:df measurame^^ to the development of llws. . 

Although the analysi.s df this mechanism ^has not been carried/very far, it does . 
begin td clarify the relationship between th'eory and measurement. " ■ 

. The problems_associated with sampling models are no more serious thaji'the. . 

problems associated with other models.** Rather'^ these prd&lems are -ndt iced more 
clearly because the" assumptions of samplihg models are more clearly stated tHan 
for other models.. ^ . . " . - 




54 



ERIC 



Bibliography - ValidUj^V- 



American.Psvchological A^seciati bh.' Standards ^ for -educatiwil imi fisy'chbleqiear 
testi . (Rev; ed') ; : Washirigtehi D.C.: ; American Psychological Association, 

1974. ■ ■ ■ - I ■ . \ : ' 

Bechtbldf, M. P. Construct Validity: A-Critiqae. American Psychologist, 
14, 1959, 619-629. , . ' , ' 

Rraithw^ii-P- Rirhard B.. S clentii^iC Explanation . ; Cambridge: Cambridge 
University Pre^s, .1953. . . , 

Brennan, R. L; General izab 11 ity analysis: principles and procedures. ACT 
Technical Bulletin no. 26 Iowa City, Iowa, American College Testing 
..Program, 1977. . . • . 

Brenhan, R. :L. and Kane, M. T.'^Generalizabillty theory: a .review. New 
' Blreetibhs fOr Testing ^and Measurement . 1979, 33-51. 

Bridgtnan, Percy W. The-LOgic of Modern Physics . New York: Maemil Ian, 1927. 

Hcll, b. T., and Fiske, D. W. Convergent and discriminant validation by 
tie multitrait-multimethod matrix. Psychological Bulletin , 1959, 81-105. . 

,„ell, D. T. S Tyler, B. B. The construct, val idity bfwbrk-group morale . 
'rheasures. -'^lOurjial of Applied Psyctology , 1957, 41, 91-92. 

Campbell, Norman R. What; is Science? London:- Met huen,, 1921. , • • 

Campbell. Norman R. Physics: The Elements. Cambridge: Cambridge University 
Press, 1920. . • . 



Cardinet, J., Tourneu'r, Y., and Allal, b. The symmetry Of general izabi lity 
theory: applications to educational measurement. Journal of Educationa4^ 
Measurement , 1976, 13, 119-134.- . ■ 

earnap; R. Testability and meaning. In H. Feigl and M.^Brodbeck (EdS.). _ 
Re^inqs ±ha Philosophy Of Science . New York, Appleton-Century-Crof ts, 
.1953, 47-92. ^5^ _ . • ' \ ' : ■■ 

Corben/ H. S Steh>,>P. Classical Mechanics, SecondNgdition . John Wiley & Sons, 
Inc., I960., New- York .| 

Cornfield, J. &-Tukey, J. W. Average Values of mean squB»eS in factorials. • 
Annals of Mathematical Statistics , 1956,. 27, 907-949.- - 

Cronbach, b. J.^Oh the design of ' educational measures. In D.NiM.de Gruijter'^ 
and L. J. T. van der Kamp fFds.l. Advances w Psychologi eal and 
- Educational Measurement . New York, John MiTey and Sons, 1976.. 

Cronbach, L. J., Test Validation, in R. L. fhomdike, _ed. Educatowal 
Measurement 1971, pp. 443-507. 



Cronbach, L. J., Gleser, G. C, Nanda, H., S Raj^ratnam, N.; The Dependability 

- • / of behavioral measurements: Thaocy M_ ^leneral izabi lity for scores and 

I ^mes. New York: Wiley, 1972 . • : 

' -^Cronbach, L. J., S Sleser^- G. C. The signal/noise ratio in the compar.isori^or 
rPli^hilitA? coefficients. Educational and Psychological Measurement , 1964, 
' •-^-^ ■ ■ ..24 i 467^480.. . •■. , > •• _ 



Cronbachi, L. d.^ .and.Meehlj P,. .:E,_CohstrUct..ya1ldlty in psycholdgicai tests. 
Psychbldgical Bulletin ^ 52/ 1955^ 281=302. ' ; 



Ebeli. R; b; Expldratibhs iri reliability theory; Ceriterripdrary PsyehSldqy j 
1974-, 19, 81-83. • . . 



Ennis, R. jH. Operation a'1 Definitions. In H. S. Brou'dy, R. H. Ennjs, and L. i. 

krimerman. (Eds.) Philosophy of Educational Research . New York, dohn 

Wiley 'and Sons, 1973, 6^0-669. - - . 

, . 1 . • ■ 

Fiske, b. W. Can^a personality construct be validated empirically? 
Psychological Bulletin ^ 1973/ 80^ 89;-92. ' 

FrahR,:P.\PhildsdphTcal_inter of the theory of 

relativity. In Fl. Feigl arid M. Brodbeefc' (Eds.^). Readings in the 
Philosopby ^ ScJ erice ,. New York, Appletori-Gentury-Grofts, 1953, .213-231. 

. Frank, Phi lipp. Philosophy of Science . Eriglewood. Cliff s, N.J.: PreriticeVHall , 

1957.. • • - . ' . . : ■ . ' • ■■ 

Gil3;iiidre^ G, M. An introduction to general izabil ity_theory as a contributor to ^ 
'evaluation research EAC Repiprt No. _ 79-14, Seattle, EducatidriaT Assessment > 
Genter, Uriiversity of Washirigton, 1979. - 

Hempel, C. G., OpiratioriiOT, observation, arid theoretical terms* In A. Danto 

and S. Morgenbesser (Eds. ) . g hilosophy ^ ^ience ; New -jYork, New 
V. American Library, Inc., 1960, 101-120. 

Hempel i Carl; G. Aspects of Scientif ic^Explanatiqn and Other Essays in the 
Phildsdphy df Science . Glencde, 111.: Free Press^ 1965. 

HempeH C. S.^ Fundamentals df Cdricept" Fdrmat i od _ i ri Empirical Science . 
^ * -Chicago: The Uriiversity of ChicaydPress, 19577 

f H-ively, W^, Patterson, H._L., I Page; S'. A; A "uriiverse-defiried" system of 
arithmetic achievement tests. Journal ^ Educational J ^easureme nt , 1968, 
5, 275-290. • ' ^ . 

^ i Kirfc^ R. E. Experimental design : __ Prdcedures for the behavioral sciences . ^ 
* Belmdrit, Cal.: Wadswdrth, 1968. 

' • ■ - - ■ , .; ' ^ 

Kuhri, Thomas S. The Structure of Scieri.tific Revolutions, Secorid Edltiori . ' V 

Internatiorial^Ericycnopedia of Unified Science, Vol. 2 No. 2 Chicago, . 
Uni^sersity. of Chicago Press, 1970. 

• Lindgjitst^ E. f. Design and analysis of . experiments in -psy4:h&Xogy iind . • 
:^ education . : Boston, Houton Mifflin;^ 1953;. 

boeviriger, Persori arid population as psychdmetric cdricepts. Psychdldgical 
Review ; 72, 1965, 143-155._ - - - ^ . 

Lord, F. M., & Novick, Mi R. Statistical "theories x>£ mental test scores . 
Reading, Mass.: Addison-Wesley, 1968. 

. ■ . _• • ; _ . ^_ _ _^ _ ■ , ■ 

'.' _.. Mactor^u^^^ Kenneth andMeehl^ Paul E. On a Distinction between ■ ' 

Hypdthiti cal Constructs and IntervenirTg V ari ab Tesv ^ P sych o 1 og i c a 1 Rev i ew , 
55, 1948,- 95-107. ' . • ' 

- PieDdriaTd, R, P, Gerieralizabil ity in factdrable ddmairis:. "ddmairi val idi ty and: 

general izabilityi" EducatidriaT arid Psychdldgical Measurement , 38, 1978 ' 
- O " . 75-79. : \ y 

ERIC^ ---^^^^^^^ 



Meehli P. E.y On the circularity of the law of effect; Psyehdleiqieal Bu11et4ri ^ 
195bi 47. . 

Niihriallyi J. C. Psychometric Theory . New York, McSraw Hill, 1967; 

.esbbrn, H.G. Item sampling for achievement testing. Education al arid 
Psychnlnnif^l Measuremerit , 1968^ 28^ 95-104. 

Pap, A., The a ^ri o fi In physical theories . New York^ RusseH S Russell, 1946; 

PHysical Science Study Committee; College Physics . Uri Haber-Sc+iain (ed.)*. 
" Raytheon Education Company, 19W. 

Popham, W; d; Cr i ter j on - refer enced measurement . Englewood Cliffs, N;d., 
Prentice-Hall, 1978. . 

Popper ,^ Karl R. The Logic xtf Scientific Discovery. New York^ Harper, 
and Row, 196^1 • : - 

Popper, Karl R. Conjecture a nd Refutations: Th& Growth- of Scientific 
• Knowledge . New York, Harper and Row, i96b. ^ 

Rajaratnam, N., Crdnbach^ L, J. , and Gleser^ G. C. Gerieralizability of 
Stratified-Parallel Tests. Psychometrika , 1965-, 30, 39-56; 

RozeBopm, W. Foundations of tlie theory Of prediction . Dorsey Press, 
Homewood,' 111., 1956. . 

^ - '1 - .-- ■ 

Smith, P. L. Sampling Errors of Variance Components in Small Sample Multifacet 
Glneralizability Studies. Journal of Educational Statistics , 3^ 1978, • 
: 319-346. 

Suppes, Patrick, The.Structure of Theories and the Analysis of Data. In W. F. 

■ Suppe fEd.) The Structure of Scientific Tbeorias . Brbana, £11.:^ The. 
University of Illinois Press, 1974. 

Suppes, P., and Zinries, 6. b. Basic measurement theory. In R. D. Luce, R; R; 
^ Bushi and- E. Galanter (Eds;).-, Handbook of Mathematical Psychology , Vol;, i. 
New York, John Wiley and Sons, 1963. . ' 

forgerson, W. S. Theory and methods o£ ^caliM. New York, dohn Wiley S Sons, . 

1958; 

. Toulmiri, . Stephen^; T^ Philosophy of Science. London: Hutchinson' s bn.iyersal 
_ • ". Library, 1953. 

Trybn,-, R. X. Rel ia'bility arid, behavior domain Validity; reformulation and 
historical .critique. Rsychalogical Bulletin , 54, 1957, 229-249. 

Winer, D. 0. StatistS'carprjnfefil es in experimental design'(2nd edj_. 

New York:; McGraw-Hi ll, 19/1. ' . , 

Wright, B. 0. S^'DougUs, _G... A. Best ProGedures for samp le-free item analysis. 
Applied Psychological Measurement ^ U 1977, 281-295. - . . 

■ . ' . ■ ^ ■ . * ■ •' • • ^ 



